What is entropy in a decision tree
Entropy  And other measures of impurity in data
This article is part 1 of 4 in the series of articles, Machine Learning Using Decision Tree Techniques.
Hierarchical classification models, which include the decision tree method, break down a quantity of data iteratively or recursively with the aim of assigning the target values (classes) as well as possible within the framework of learning (training phase of supervised learning) clean up, i.e. to get unique class assignments for certain properties in the features. The decomposition of the data takes place via an information gain, which is necessary for the classification with a Measure of impurity is calculated (in the next article in the series we will calculate the entropy!)
Entropy as a measure of impurity in data
Entropy is the most commonly used measure of impurity in computer science. It is used with a similar meaning in thermodynamics and fluid dynamics, where it does not refer to data, but to the distribution of temperature / gas / liquid mixtures. For example, if you imagine a KiBa drink (cherrybanana mixture), the purity is high if you see a twotone drink in front of you (e.g. yellow below, red above). The impurity is high, the drink is well mixed so that no two colors can be distinguished.
In machine classification, we prefer purity, because the more distinguishable two (or more) classes, the more reliable the classification. Thus, we must also be able to evaluate the data with regard to their impurity, which we do with entropy in the sense of computer science:
With two classes to be separated:
Implemented in Python, the algorithm could be implemented as follows, for example:
return  p * math.log2 (p)  (1p) * math.log2 (1  p) 
The entropy is maximum (with two classes: 1.00) if the classes within a superset are equally frequent. If a class gains quantitative dominance, the probability for this class increases equally, and the entropy decreases. Entropy, as a measure of impurity, only works mathematically with probabilities greater than 0 and less than 1, which we should intercept:
return  p * math.log2 (p)  (1p) * math.log2 (1  p) 
Let's look at the entropy visually:

The Gini coefficient as a measure of impurity
The Gini coefficient works similarly to entropy and assumes its maximum for classes with exactly the same frequency, which is 0.5. The Gini coefficient therefore has a flatter degressive growth compared to the entropy and therefore reacts somewhat less sensitively to changes in class frequencies.
In Python:
return p * q + (1  p) * (1  (1  p)) 
The classification error as a measure of the impurity
The classification error is even less sensitive to changes in the class frequencies (= class probabilities):
In Python:
return 1  np.max ([p, 1  p]) 
The classification error should not be used for the construction of decision trees, especially in the case of multiple classification, but is the recommended measure for the subsequent pruning of the tree branches (decision trees tend to have large, complex models and overfitting, which can be corrected by pruning).
Impurity  Bringing it all together
Let us now bring entropy, Gini coefficient and classification errors into one representation:
x = np.arange (0.0, 1.0, 0.001) ent = [entropy (p) for p in x] ent_scaled = [e * 0.5 for e in ent] err = [error (i) for i in x] fig = plt.figure (figsize = (8,7)) for i, lab, ls, c, in zip ([ent, ent_scaled, gini (x), err], ['Entropy', 'Entropy (scaled)', 'Gini coefficient', 'Classification error'], ['red', 'orange', 'black', 'black']): line = ax.plot (x, i, label = lab, linestyle = ls, lw = 2, color = c) ax.legend (loc = 'upper center', bbox_to_anchor = (0.5, 1.30), ncol = 3) ax.axhline (y = 0.5, linewidth = 1, color = 'k', linestyle = '') ax.axhline (y = 1.0, linewidth = 1, color = 'k', linestyle = '') 
In addition to the “normal” entropy, we have also added the scaled entropy (= entropy / 2) to show that the slope of the entropy is significantly stronger than the Gini coefficient.
It is also noticeable that the classification error (visible as a triangle in the illustration) reacts to changes in the class frequencies significantly less sensitive to entropy and also to the Gini coefficient, namely linear in a twoclass system.
References to literature
The following books provide a good basis for machine learning and also explain the decision tree procedure:
Benjamin Aunkofer
Benjamin Aunkofer is lead data scientist at DATANOMIQ and university lecturer for data science and data strategy. In addition, he works as Interim Head of Business Intelligence and gives seminars / workshops on BI, data science and machine learning for companies.
 How hot it gets in Milan
 How to learn medicine at home
 TN is an atheist state in India
 How does an analog system work
 If 0 is an even number 1
 Which is your favorite French saying
 Are all bloggers and influencers fake
 Why Are Some Americans Against Multiculturalism
 Why should people knowingly harm you
 What are some topics of oral surgery
 Who is the best friend
 Where are pearls grown in India
 Where is Evernote's Singapore office
 What is a threestate area
 What is average and above average IQ
 Are men brave
 What is considered atypical depression
 Can India compete with China?
 Classical music has a beat
 Why was PUBG released for PS4
 How do IAS officers get so rich
 Which artist sang the song Racks
 What is a tshirt
 People don't shred paper anymore