Introduction

$\color{black}\rule{365px}{3px}$

The Gini Index and Entropy are both impurity measures used in decision tree algorithms to determine how well a dataset is split, but they have slightly different interpretations and properties.

<aside> 💡

Impurity quantifies how mixed or homogeneous the classes within a dataset are

The more mixed (more equally distributed) classes, the higher impurity.
The more homogeneous (some class dominates), the less impurity. </aside>

But how are they different?

Both metrics aims to measure the impurity of a dataset but in different impressions.

Gini Index

$\color{black}\rule{365px}{3px}$

Technical Definition

$$ \text{Gini}(D)=1− \sum_{i=1}^C{p_i}^2

where:

$p_i$ is the proportion of class $i$ in the dataset, and
$C$ is the number of classes in the dataset.

Example

Imagine a dataset of color $D=$ $\{$Red, Blue$\}$ where:

70% belong to Red, and
30% belong to Blue

Then,

$$ \text{Gini}(D)=1−(0.7^2+0.3^2)=1−(0.49+0.09)=1−0.58=0.42 $$

</aside>

Intuitive Explanation

The Gini Index evaluates the probability of picking two random samples and finding that they belong to different classes. The formula achieves this by subtracting the probability of picking two random samples to be same class from 1 (total probability).

Let’s revisit the formula and interpret it.

$$ \text{Gini}(D)=1− \sum_{i=1}^C{p_i}^2

$p_i$ is the proportion of class $i$ in the dataset (i.e. probability)
$p_i^2$ represents probability that two randomly selected samples both belong to class $i$.
Summing up $p_i^2$ across all classes gives the total probability of two random samples being same class for all classes $i=1,2, ..., C$.
Subtracting this from 1 gives the probability that two randomly chosen samples belong to different classes, which corresponds to the level of impurity in the node.

$$ \begin{align*} \text{Gini}(D) &= P(\text{two random samples being different class}) \\&= 1 - P(\text{two random samples being same class})\\ &= 1 - \sum_{i=1}^Cp_i^2 \end{align*} $$

So, in the end, Gini Index aims to measure how likely it is to misclassify a randomly chosen sample. It does this by examining how uneven the distribution of classes is in a node. Imagine, if there was one dominant class, there is absolutely NO chance to misclassify the data in that node. However, if there were a lot of classes with equally likely distribution, it is very likely to misclassify the data!

</aside>