What is entropy in a decision tree

Entropy - And other measures of impurity in data

This article is part 1 of 4 in the series of articles, Machine Learning Using Decision Tree Techniques.

Hierarchical classification models, which include the decision tree method, break down a quantity of data iteratively or recursively with the aim of assigning the target values ​​(classes) as well as possible within the framework of learning (training phase of supervised learning) clean up, i.e. to get unique class assignments for certain properties in the features. The decomposition of the data takes place via an information gain, which is necessary for the classification with a Measure of impurity is calculated (in the next article in the series we will calculate the entropy!)

Entropy as a measure of impurity in data

Entropy is the most commonly used measure of impurity in computer science. It is used with a similar meaning in thermodynamics and fluid dynamics, where it does not refer to data, but to the distribution of temperature / gas / liquid mixtures. For example, if you imagine a KiBa drink (cherry-banana mixture), the purity is high if you see a two-tone drink in front of you (e.g. yellow below, red above). The impurity is high, the drink is well mixed so that no two colors can be distinguished.

In machine classification, we prefer purity, because the more distinguishable two (or more) classes, the more reliable the classification. Thus, we must also be able to evaluate the data with regard to their impurity, which we do with entropy in the sense of computer science:

With two classes to be separated:

Implemented in Python, the algorithm could be implemented as follows, for example:

return - p * math.log2 (p) - (1-p) * math.log2 (1 - p)

The entropy is maximum (with two classes: 1.00) if the classes within a superset are equally frequent. If a class gains quantitative dominance, the probability for this class increases equally, and the entropy decreases. Entropy, as a measure of impurity, only works mathematically with probabilities greater than 0 and less than 1, which we should intercept:

return - p * math.log2 (p) - (1-p) * math.log2 (1 - p)

Let's look at the entropy visually:

import matplotlib.pyplot as plt
x = np.arange (0.0, 1.0, 0.001)
ent = [entropy (p) for p in x]
fig = plt.figure (figsize = (8,7))
ax.plot (x, ent, color = 'red', lw = 2)

The Gini coefficient as a measure of impurity

The Gini coefficient works similarly to entropy and assumes its maximum for classes with exactly the same frequency, which is 0.5. The Gini coefficient therefore has a flatter degressive growth compared to the entropy and therefore reacts somewhat less sensitively to changes in class frequencies.

In Python:

return p * q + (1 - p) * (1 - (1 - p))

The classification error as a measure of the impurity

The classification error is even less sensitive to changes in the class frequencies (= class probabilities):

In Python:

return 1 - np.max ([p, 1 - p])

The classification error should not be used for the construction of decision trees, especially in the case of multiple classification, but is the recommended measure for the subsequent pruning of the tree branches (decision trees tend to have large, complex models and overfitting, which can be corrected by pruning).

Impurity - Bringing it all together

Let us now bring entropy, Gini coefficient and classification errors into one representation:

x = np.arange (0.0, 1.0, 0.001)
ent = [entropy (p) for p in x]
ent_scaled = [e * 0.5 for e in ent]
err = [error (i) for i in x]
fig = plt.figure (figsize = (8,7))
for i, lab, ls, c, in zip ([ent, ent_scaled, gini (x), err],
['Entropy', 'Entropy (scaled)', 'Gini coefficient', 'Classification error'],
['red', 'orange', 'black', 'black']):
line = ax.plot (x, i, label = lab, linestyle = ls, lw = 2, color = c)
ax.legend (loc = 'upper center', bbox_to_anchor = (0.5, 1.30), ncol = 3)
ax.axhline (y = 0.5, linewidth = 1, color = 'k', linestyle = '-')
ax.axhline (y = 1.0, linewidth = 1, color = 'k', linestyle = '-')

In addition to the “normal” entropy, we have also added the scaled entropy (= entropy / 2) to show that the slope of the entropy is significantly stronger than the Gini coefficient.

It is also noticeable that the classification error (visible as a triangle in the illustration) reacts to changes in the class frequencies significantly less sensitive to entropy and also to the Gini coefficient, namely linear in a two-class system.

References to literature

The following books provide a good basis for machine learning and also explain the decision tree procedure:

Benjamin Aunkofer

Benjamin Aunkofer is lead data scientist at DATANOMIQ and university lecturer for data science and data strategy. In addition, he works as Interim Head of Business Intelligence and gives seminars / workshops on BI, data science and machine learning for companies.

Tags:Classification, Data Science, Decision Tree, Entropy, Impurity, Machine Learning, Python