Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (2023)

Though Decision Trees look simple and intuitive, there is nothing very simple about how the algorithm goes about the process deciding on splits and how tree pruning occurs. In this post I take you through a simple example to understand the inner workings of Decision Trees.

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (1)

Decision Trees are a popular and surprisingly effective technique, particularly for classification problems. But, the seemingly intuitive interface hides complexities. The criterion for selecting variables and hierarchy can be tricky to get, not to mention Gini index, Entropy ( wait, isn’t that physics?) and information gain (isn’t that information theory?). As you can see there are lots of tricky problems on which you can get stuck on. The best way to understand Decision Trees is to work through a small example which has sufficient complexity to be able to demonstrate some of the common points one suddenly goes, ‘ not sure what happens here…?’.

This post is therefore more like a tutorial or a demo where I will work through a toy dataset that I have created to understand the following:

1. What is a decision tree: root node, sub nodes, terminal/leaf nodes

2. Splitting criteria: Entropy, Information Gain vs Gini Index

3. How do sub nodes split

4. Why do trees overfit and how to stop this

5. How to predict using a decision tree

So, let’s get demonstrating…

1. What does a Decision Tree do?

Let’s begin at the real beginning with core problem. For example, we are trying to classify whether a patient is diabetic or not based on various predictor variables such as fasting blood sugar, BMI, BP, etc. This is obviously a prediction problem for a new patient. We also have 1000 patient records to help us develop an understanding of which features are most useful in predicting. Unlike other classification algorithms such as Logistic Regression, Decision Trees have a somewhat different way of functioning and identifying which variables are important.

The first thing to understand in Decision Trees is that they split the predictor space, i.e., the target variable into different sub groups which are relatively more homogenous from the perspective of the target variable. For example, if the target variable is binary, with categories 1 and 0 ( shown by green and red dots in the image below, then the decision tree works to split the target variable space into sub groups that are more homogenous in terms of having either 1’s or 0’s.

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (2)

That is the overall concept. Let us begin with understanding the various elements of a decision tree.

Understanding components of a Decision Tree

A decision tree is a branching flow diagram or tree chart. It comprises of the following components:

. A target variable such as diabetic or not and its initial distribution.

  • A root node: this is the node that begins the splitting process by finding the variable that best splits the target variable
  • Node purity: Decision nodes are typically impure, or a mixture of both classes of the target variable (0,1 or green and red dots in the image). Pure nodes are those that have one class — hence the term pure. They either have green or red dots only in the image.
  • Decision nodes: these are subsequent or intermediate nodes, where the target variable is again split further by other variables
  • Leaf nodes or terminal nodes are pure nodes, hence are used for making a prediction of a numerical or class is made.

Let’s see this visually..

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (3)

In general a decision tree takes a statement or hypothesis or condition and then makes a decision on whether the condition holds or does not. The conditions are shown along the branches and the outcome of the condition, as applied to the target variable, is shown on the node.

Arrows leading away from a node indicate a condition which is being applied to the node. Arrows pointing to a node indicate a condition that is being satisfied.

This is the first level of the Decision Tree — understanding the flow of splitting the decision space into smaller spaces which ultimately become more and more homogenous in the target variable. This ultimately leads to a prediction.

Decision Trees offer tremendous flexibility in that we can use both numeric and categorical variables for splitting the target data. Categoric data is split along the different classes in the variable. Numeric is a little more tricky as we have to split into thresholds for the condition being tested such as <18 and ≥18, for example. A numeric variable can appear multiple times in the data with different cut offs or thresholds. Also final classifications can be repeated.

The important things from data science perspective are:

1. Flow of information through the Decision Tree

2. How does Decision Trees select which variable to split on at decision nodes?

3. How does it decide that the tree has enough branches and that it should stop splitting?

Now let us look at a simplified toy example to understand the above process more concretely.

First the problem:

We have data for 15 data points of student data on pass or fail an online ML exam. To understand the basic process we begin with a dataset which comprises a target variable that is binary ( Pass/Fail) and various binary or categorical predictor variables such as:

  • whether enrolled in other online courses
  • whether student is from a maths, computer science or other background
  • Whether working or not working

The dataset is given below:

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (4)

Notice that only one variable, ‘Student Background’ has more than 2 levels or categories — Maths, CS, Others. It is one for the advantages of Decision Trees compared to other classification models such as Logistic Regression or SVM, that we do not need to carry out one hot encoding to make these into dummy variables.

Let us first look at the flow of how a decision tree works and then we will dive into the complexities of how the decisions are actually made…

Flow of a Decision Tree

A decision tree begins with the target variable. This is usually called the parent node. The Decision Tree then makes a sequence of splits based in hierarchical order of impact on this target variable. From the analysis perspective the first node is the root node, which is the first variable that splits the target variable.

To identify the root node we would evaluate the impact of all the variables that we have currently on the target variable to identify the variable that splits the exam Pass/Fail classes into the most homogenous groups. Our candidates for splitting this are: Background, Working Status and Other Online Courses.

What do we hope to achieve with this split? Suppose we begin with Working Status as the root node. This splits into 2 sub nodes, one each for Working and Not working. Thus the Pass/Fail status is updated in each sub node respectively.

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (5)
(Video) Gini Index and Entropy|Gini Index and Information gain in Decision Tree|Decision tree splitting rule

So, this is the basic flow of the Decision Tree. As long as there is a a mixture of Pass and Fail in a sub node, there is scope to split further to try and get it to be only one category. This is termed the purity of the node. For example, Not Working has 5 Pass and 1 Fail, hence it is purer than the Working node which has 5P and 4F. A leaf node would be one which contains either Pass or Fail class only.

A node which is impure can be branched further for improving purity. However, most of the time we do not necessarily go down to the point where each leaf is ‘pure’. It is also important to understand that each node is standalone and hence the attribute that best splits the ‘Working’ node may not be the one that best splits the ‘Not Working’ node.

Now, let us move on to learning the core part of Decision Trees - key questions:

How does the Tree decide which variable to branch out at each level?

Greedy top down approach

Decision trees follow a a top-down, greedy approach that is known as recursive binary splitting. The recursive binary splitting approach is top-down because it begins at the top of the tree and then it successively splits the predictor space. At each split the predictor space gets divided into 2 and is shown via two new branches pointing downwards. The algorithm is termed is greedy because at each step of the process, the best split is made for that step. It does not project forwards and try and pick a split that might be more optimal for the overall tree.

The algorithm therefore evaluates all variables on some statistical criteria and then chooses the variable that performs best on the criteria.

Variable selection criterion

Here is where the true complexity and sophistication of decision lies. Variables are selected on a complex statistical criterion which is applied at each decision node. Now, variable selection criterion in Decision Trees can be done via two approaches:

1. Entropy and Information Gain

2. Gini Index

Both criteria are broadly similar and seek to determine which variable would split the data to lead to the underlying child nodes being most homogenous or pure. Both are used in different Decision Tree algorithms. To add to the confusion, it is not clear which one is the preferred approach. So, one has to have an understanding of both.

Let us begin with Entropy and Information Gain criterion

What is Entropy?

Entropy is a term that comes from physics and means a measure of disorder. More specifically, we can define it as:

Entropy is a scientific concept as well as a measurable physical property that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the microscopic description of nature in statistical physics, and to the principles of information theory.

https://en.wikipedia.org/wiki/Entropy

In information theory, the entropy of a random variable is the average level of “information”, “surprise”, or “uncertainty” inherent to the variable’s possible outcomes.

https://en.wikipedia.org/wiki/Entropy_(information_theory)

In the context of Decision Trees, entropy is a measure of disorder or impurity in a node. Thus, a node with more variable composition, such as 2Pass and 2 Fail would be considered to have higher Entropy than a node which has only pass or only fail. The maximum level of entropy or disorder is given by 1 and minimum entropy is given by a value 0.

Leaf nodes which have all instances belonging to 1 class would have an entropy of 0. Whereas, the entropy for a node where the classes are divided equally would be 1.

Entropy is measured by the formula:

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (6)

Where the pi is the probability of randomly selecting an example in class i. Let us understand this a bit better in the context of our example. So, the initial entropy of at the parent node is given by the probability of getting a pass vs fail. In our dataset, the target variable has 9 passes and 6 fails. Hence the probabilities for the entropy formula are:

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (7)
Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (8)

Now essentially what a Decision Tree does to determine the root node is to calculate the entropy for each variable and its potential splits. For this we have to calculate a potential split from each variable, calculate the average entropy across both or all the nodes and then the change in entropy vis a vis the parent node. This change in entropy is termed Information Gain and represents how much information a feature provides for the target variable.

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (9)

Entropy_parent is the entropy of the parent node and Entropy_children represents the average entropy of the child nodes that follow this variable. In the current case since we have 3 variables for which this calculation must be done from the perspective of the split.

1. Work Status

2. Online Course Status

3. Student Background

To calculate entropy, first let us put our formulas for Entropy and Information Gain in terms of variables in our dataset:

  1. Probability of pass and fail at each node, i.e, the Pi:
Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (10)

2. Entropy:

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (11)

3. Average Entropy at child nodes:

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (12)
(Video) How to find Entropy Information Gain | Gini Index Splitting Attribute Decision Tree by Mahesh Huddar

Note, average Entropy is the weighted average of all the sub nodes that a parent node splits into. Thus, in our example this would be 2 sub nodes for Working Status and 3 sub nodes for Student Background.

4. Information Gain:

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (13)

Parent Node Calculations

First we will calculate parent node entropy using the formula above. Use any log2 calculator online to calculate the log values. In our case they work out to:

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (14)
Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (15)

(Mathematical note: log to base 2 of anything less than 1 is a negative number, hence we multiply by a minus sign to get a positive number)

So far, this is just the entropy of the parent node. Now we have to decide which attribute or variable to use to split this to get the root node.

Calculating the Root Node

For this we have to calculate a potential split from each variable, calculate the average entropy across both the nodes and then the change in entropy via a vis the parent node.

Let us begin with Work Status variable and calculate the entropy of the split.

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (16)
Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (17)

We then calculate the average entropy for the Working status split as a weighted average with weights of the share of observations from the total number of observations that fall in each sub node.

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (18)

Information Gain = Entropy_Parent — Entropy_child =

0.9183–0.8119 = .1858

(Calculations are shown in the spreadsheet below)

In a similar fashion we can evaluate the entropy and information gain for Student Background and Online Courses variables. The results are provided in the table below:

We will drop this variable for the time being and move on to evaluate the other variables. The spreadsheet below shows the Entropy calculations for all the variables:

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (19)

To find the root node attribute we look at the Information gain from Student Background vis a vis initial parent entropy. This shows the maximum reduction of .5370. Hence, this is the attribute that would be selected as the root node. The other’s variables — ‘Working Status’ and ‘Online Courses’ show a much lower decrease in entropy vis a vis the parental node.

So, on the basis of the above calculations, we have determined what the root node would be. The tree would now look as follows:

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (20)

Student Background splits the target variable into 3 groups. Everyone from CS background clearly passes and hence this is a terminal or leaf node. Everyone from Other backgrounds fails and this is also a terminal node. Maths background is split into 3 Pass and 4 Fail and hence it is impure and there is some scope for further splitting to attain greater purity.

Now to split the Maths background sub node, we need to calculate Entropy and Information Gain for the remaining variables, i.e., Working Status and Online Courses. We would then select the variable that shows the highest Information Gain.

The Entropy and Information Gain Calculations for the Maths Background node can be seen in the table below. Notice, we now have the Maths Background as the node that is being split, hence average Entropy for the splits is calculated using it as a base.

Note: putting in log(0) throws an error. However mathematically we can use the limit. Normally we just don’t include the pj=0 case in the calculation. However, I have included just to show the complete calculation.

By convention 0 log 0 = 0, since y log y → 0 as y → 0.

https://www.icg.isy.liu.se/courses/infotheory/lect1.pdf

The Entropy for each potential split is:

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (21)

As we can see Information Gain is higher for the Working Status variable. Hence this is the variable used to continue branching.

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (22)

We now see that the Maths node has split into 1 terminal node on the right and one node which is still impure. Notice, now almost all our nodes are terminal nodes. There is only one node which is not terminal. We can try splitting it further using Other Online Courses. Anyway, you get the picture. In any case most Decision Trees do not necessarily split to the point where every node is a terminal node. Most algorithms have built in stops which we will discuss a little further down. Further, if the Decision Tree continues to split we have another problem which is that of overfitting. Again we shall discuss that below after we have briefly reviewed an alternative approach to developing a Decision Tree using the Gini Index.

(Video) 7.6.2. Entropy, Information Gain & Gini Impurity - Decision Tree

Gini Index

The other way of splitting a decision tree is via the Gini Index. The Entropy and Information Gain method focuses on purity and impurity in a node. The Gini Index or Impurity measures the probability for a random instance being misclassified when chosen randomly. The lower the Gini Index, the better the lower the likelihood of misclassification.

The formula for Gini Index

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (23)

Where j represents the no. of classes in the target variable — Pass and Fail in our example

P(i) represents the ratio of Pass/Total no. of observations in node.

So, Let’s take an example from the decision tree above. Let’s begin with the root node and calculate the Gini Index for each of the splits. The Gini Index has a minimum (highest level of purity) of 0. It has a maximum value of .5. If Gini Index is .5, it indicates a random assignment of classes.

Now let us calculate the Gini index for the root node for Student Background attribute. In this case we have 3 nodes. Gini formula requires us to calculate the Gini Index for each sub node. Then do a weighted average to calculate the overall Gini Index for the node.

Maths sub node: 4Pass, 3Fail

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (24)

CS sub node: 4Pass, 0 Fail

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (25)

Others sub node: 0Pass, 4 Fail

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (26)

As we can see the probability for misclassification in CS node is zero, since everyone passes. Similarly no scope for misclassification on Others node as everyone fails. Only the maths node has possibility of misclassification, and this is quite high, given that the maximum Gini Index is .5.

The overall Gini Index for this split is calculated similarly to the entropy as weighted average of the distribution across the 3 nodes.

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (27)

Similarly, we can also compute the Gini Index for Working Status and Online Courses. These are given below:

Working/Not working

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (28)

Online Courses

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (29)

The Gini Index is lowest for the Student Background variable. Hence, similar to the Entropy and Information Gain criteria, we pick this variable for the root node. In a similar fashion we would again proceed to move down the tree, carrying out splits where node purity is less

Gini Index vs Information Gain

Depending on which impurity measurement is used, tree classification results can vary. This can make small (or sometimes large) impact on your model. There seems to be no one preferred approach by different Decision Tree algorithms. For example, CART uses Gini; ID3 and C4.5 use Entropy.

The Gini index has a maximum impurity is 0.5 and maximum purity is 0, whereas Entropy has a maximum impurity of 1 and maximum purity is 0.

How does a prediction get made in Decision Trees

Now that we have understood, hopefully in detail, how Decision Trees carry out splitting and variable selection, we can move on to how they do prediction. Actually, once a tree is trained and tested, prediction is easy. The tree basically provides a flow chart based on various predictor variables. Suppose we have a new instance entering the flow along with its values of different predictor variables. Unlike training and test data , it will not have the class for the target attribute. We are trying to predict this class by moving down the tree, testing its values of different predictor variables at different branches. Ultimately, the new instance will move into a leaf node and will be classified according to the class prevailing in the leaf node.

Suppose it looks like the below configuration:

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (30)

Based on our tree, we would first check the Math branch, then Working Yes branch. That as we have seen is a leaf node and the new observation would be classified on the basis of the majority vote in this node, i.e., since it is Pass, this new observation would also be predicted to be Pass.

In practice, when the algorithm is evaluating a new example and reaches a leaf node, the prediction is based on the modal value of categories in the leaf node. As seen in the above case, the Working node is not fully pure. However, we go with the prediction of the modal value, which is Pass. In general, most leaf nodes are not pure and, hence for categorical prediction, we use the modal value for prediction. If it is a numerical prediction (regression tree), we predict the mean value of the target values at each leaf node.

Overfitting and Decision Trees

Overfitting can be a big challenge with Decision Trees. Even in our toy example, we can see the algorithm continues to split till it reaches a leaf node. Often the leaf node may just have one or two instances. This will clearly lead to a complex tree structure which may not generalize well to a test scenario. This is because each leaf will represent a very specific set of attribute combinations that are seen in the training data, and the tree will not be able to classify attribute combinations not seen in the training data. There are several ways we can prevent the decision tree from becoming too unwieldy: 3 broad approaches to avoiding overfitting are distinguished:

  1. Pre pruning or Early stopping: Preventing the tree from growing too big or deep
  2. Post Pruning: Allowing a tree to grow to its full depth and then getting rid of various branches based on various criteria

3. Ensembling or using averages of multiple models such as Random Forest

We will only briefly overview at pre and post pruning techniques here. Ensemble techniques such as Random Forest require more explanation and hence will be tackled in a separate article.

(Video) Tutorial 38- Decision Tree Information Gain

Pruning

Pre pruning

The pre-pruning technique refers to the early stopping of the growth of the decision tree. The pre-pruning technique involves tuning the hyperparameters of the decision tree model prior to the training pipeline. The hyperparameters of the DecisionTreeClassifier in SkLearn include max_depth, min_samples_leaf, min_samples_split which can be tuned to early stop the growth of the tree and prevent the model from overfitting. The best way is to use the sklearn implementation of the GridSearchCV technique to find the best set of hyperparameters for a Decision Tree model.

A challenge with the early stopping approach is that it faces a ‘horizon’ problem, where an early stopping may prevent some more fruitful splits down the line.

Post Pruning

In this technique, we allow the tree to grow to its maximum depth. Then we remove parts of the tree to prevent overfitting. We effectively consider subtrees of the full tree which are evaluated on a criteria and then removed. Hence we are effectively going ‘up’ the tree and converting leaves to nodes and subtrees. The criteria whether a particular consolidation goes through or not is usually MSE for regression trees and classification error for classification trees.

A challenge with post pruning is that a decision tree can grow very deep and large and hence evaluating every branch can be computationally expensive. An important post pruning technique is Cost complexity pruning (ccp) which provides a more efficient solution in this regard. CCP is a complex and advanced technique which is parametrized by the parameter α in Scikit Learn DecisionTreesclassifier module.

So, how does CCP work and what does it do?

The basic problem that CCP addresses is: How to determine the best way to prune a tree? Intuitively, we would select a sub tree to prune such that its removal leads to a lower test error rate. This can be done using cross validation or, if we have sufficient sample, the validation set approach. However, given the number of sub trees in a fully grown tree for even a small sample, this is likely to be a very computationally and time intensive process.

Cost complexity pruning — also known as weakest link pruning — gives us a way to do just this. Rather than considering every possible subtree, we consider a sequence of trees indexed by a nonnegative tuning parameter α.…….

The tuning parameter α controls a trade-off between the subtree’s complexity and its fit to the training data. When α = 0, then the subtree T will simply equal T0, because then (8.4) just measures the training error. However, as α increases, there is a price to pay for having a tree with many terminal nodes, and so the quantity (8.4) will tend to be minimized for a smaller subtree

An Introduction to Statistical Learning, P 333

Essentially the parameter α is very similar to the penalty term in Lasso regression. The basic equation for CCP is given below

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (31)

This is one complex equation. But let’s try and understand a bit further. Some definitions:

Rm: Rm is the rectangle (i.e. the subset of predictor space) corresponding to the mth terminal node

yˆRm is the predicted response associated with mth terminal leaf

(yi -yˆRm) — MSE for the sub tree referenced by terminal node m ( we are using regression tree approach for this equation. For simplicity I am following the equation and approach in James et al(2014).

|T|: is the no. of terminal nodes in the tree T

Now let’s see what the equation is doing. We are essentially minimizing cost or loss given by (yi -yˆRm) across all terminal nodes. Now α is a term that multiplies the total number terminal nodes in the tree. If α=0, then we are minimizing the training loss. The tree will be the same as the original tree. However, with α>0, we add a penalty term which increases with the number of terminal nodes |T|. This means the overall cost gets minimized for a smaller subtree.

Minimal cost complexity pruning recursively finds the node with the “weakest link”. The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first.

https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html

What does this mean? It means that the algorithm is hunting out nodes where the training loss is already high and hence can only be minimized with a small α. On the other hand, nodes where the training loss is smaller, can accommodate a larger penalty term as part of minimization.

To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process.

https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html

As α increases, more of the tree is pruned. We then have a tradeoff between bias and variance. With the ccp_alpha

Effectively we increase the bias of the model, i.e. , we simplify it. However, on the con side this means we have to tolerate increasing levels of impurity in the terminal nodes. We see that as α increases both no. of nodes and tree depth reduces.

How to determine optimal α

Plotting ccp_alpha vs train and test accuracy we see that when α =0 and keeping the other default parameters of DecisionTreeClassifier, the tree overfits, leading to a 100% training accuracy and 88% testing accuracy. As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better. at some point, however, further increases in α actually lead to a decrease in test accuracy as the model becomes too simplified. In this example, setting ccp_alpha=0.015 maximizes the testing accuracy. (refer to https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html for details).

Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning.. (32)

Advantages and Disadvantages of Trees Decision trees

1. Trees give a visual schema of the relationship of variables used for classification and hence are more explainable. The hierarchy of the tree provides insight into variable importance.

2. At times they can actually mirror decision making processes.

3. White box model which is explainable and we can track back to each result of the model. This is in contrast to black box models such as neural networks.

4. In general there is less need to prepare and clean data such as normalization and one hot encoding of categorical variables and missing values.

Note the Sklearn implementation currently does not support categorical variables, so we do need to create dummy variables. Similarly it does not support missing values. But both can be handled in theory.

5. Model can be validated statistically

Disadvantages

1. Prone to overfitting and hence lower predictive accuracy

2. Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem for example can be mitigated by using decision trees within an ensemble

3. Can be non-robust, i.e., a small change in the data can cause a large change in the final estimated tree

4. Predictions are approximate, based on relevant terminal nodes. Hence it it may not be the best method to extrapolate the results of the model to unseen cases.

5. Decision tree learners create biased trees if some classes dominate. It is required to balance the dataset prior to fitting with the decision tree.

So, that’s it for Decision Trees form start to at least two thirds of the way. There are a lot of complexities, hence I cannot say end. I hope you liked this blog on the inner workings of Decision Trees. One thing is clear, this is far from a simple technique. I have so far reviewed only the complexities of how variables hierarchy is chosen and a tree structure is built up and how pruning is done. There are many types of Decision Tree algorithms even in Scikit Learn. These include: ID3, C4.5, C5.0 and CART. Exploration of these models is for another blog.

(Video) Decision Tree Important Points ll Machine Learning ll DMW ll Data Analytics ll Explained in Hindi

References

  1. · G. James, D. Witten, T. Hastie, and R. Tibshirani (2013), An Introduction to Statistical Learning: with Applications in R, Springer, (2013)
  2. Satyam Kumar, ‘3 Techniques to Avoid Overfitting of Decision Trees’, Towards Data Science, https://towardsdatascience.com/3-techniques-to-avoid-overfitting-of-decision-trees-1e7d3d985a09
  3. https://scikit-learn.org/stable/modules/tree.html#minimal-cost-complexity-pruning
  4. .
  5. https://www.sciencedirect.com/topics/computer-science/cost-complexity#:~:text=Cost%2Dcomplexity%20pruning%20is%20based,leaf%20of%20the%20subtree%20concerned.
  6. https://en.wikipedia.org/wiki/Entropy_(information_theory)

FAQs

What is gini vs entropy vs information gain decision tree? ›

The Entropy and Information Gain method focuses on purity and impurity in a node. The Gini Index or Impurity measures the probability for a random instance being misclassified when chosen randomly. The lower the Gini Index, the better the lower the likelihood of misclassification.

What is gini and Gini index decision tree? ›

Gini Index is a powerful measure of the randomness or the impurity or entropy in the values of a dataset. Gini Index aims to decrease the impurities from the root nodes (at the top of decision tree) to the leaf nodes (vertical branches down the decision tree) of a decision tree model.

What is Gini index entropy decision tree? ›

Gini index and entropy is the criterion for calculating information gain. Decision tree algorithms use information gain to split a node. Both gini and entropy are measures of impurity of a node. A node having multiple classes is impure whereas a node having only one class is pure.

How are entropy and information gain related with decision trees? ›

The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).

What is pruning in decision trees and how is it done? ›

Pruning is a data compression technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances.

Should I use Gini or entropy? ›

Conclusions. In this post, we have compared the gini and entropy criterion for splitting the nodes of a decision tree. On the one hand, the gini criterion is much faster because it is less computationally expensive. On the other hand, the obtained results using the entropy criterion are slightly better.

What is CCP Alpha in decision tree? ›

Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier , this pruning technique is parameterized by the cost complexity parameter, ccp_alpha . Greater values of ccp_alpha increase the number of nodes pruned.

What is the difference between Gini index and information gain? ›

Gini Index vs Information Gain

Gini index is measured by subtracting the sum of squared probabilities of each class from one, in opposite of it, information gain is obtained by multiplying the probability of the class by log ( base= 2) of that class probability.

What is the difference between gain ratio and Gini index? ›

Gain Ratio is a complement of Information Gain, was born to deal with its predecessor's major problem. Gini Index, on the other hand, was developed independently with its initial intention is to assess the income dispersion of the countries but then be adapted to work as a heuristic for splitting optimization.

What is the difference between entropy and information gain? ›

Entropy and information gain are key concepts in domains such as information theory, data science, and machine learning. Information gain is the amount of knowledge acquired during a certain decision or action, whereas entropy is a measure of uncertainty or unpredictability.

What is entropy and information gain? ›

Entropy is uncertainty/ randomness in the data, the more the randomness the higher will be the entropy. Information gain uses entropy to make decisions. If the entropy is less, information will be more. Information gain is used in decision trees and random forest to decide the best split.

What is gain in decision tree? ›

Information gain is the basic criterion to decide whether a feature should be used to split a node or not. The feature with the optimal split i.e., the highest value of information gain at a node of a decision tree is used as the feature for splitting the node.

How is Gini index used in decision tree? ›

The value of Gini Coefficient is used in decision trees to split the population into two equal halves. The value of Gini Coefficient at which the population is exactly split is always greater than or equal to 0.50.

What is the difference between post pruning and pre pruning? ›

As the names suggest, pre-pruning or early stopping involves stopping the tree before it has completed classifying the training set and post-pruning refers to pruning the tree after it has finished.

How gaining information can reduce entropy? ›

Information Gain, or IG for short, measures the reduction in entropy or surprise by splitting a dataset according to a given value of a random variable. A larger information gain suggests a lower entropy group or groups of samples, and hence less surprise.

What are the three types of tree pruning? ›

In pruning, there are three primary types of pruning cuts, thinning cuts, reduction cuts, and heading cuts, each giving different results in growth and appearance.

What are three general rules in pruning trees? ›

Know where to cut.

ALWAYS prune back to or just above a growing point (branch or bud) or to the soil line. NEVER leave a stem or branch stub. NEVER top a tree to “rejuvenate” growth.

What are the four steps in pruning? ›

Remove all dead, diseased, or damaged stems and branches. Remove suckers and sprouts (vertical growth that shoots up from the bottom of the tree). Remove a limb that crosses over and touches another (invitation for disease). Remove growth to slightly reduce the size of the plant.

Why is the Gini index flawed? ›

One of the drawbacks of the coefficient is that it does not take into consideration the structural changes in a population. Such changes can significantly influence the economic inequality in a population. Generally, the situation arises because young people tend to earn less relative to older people.

Why is Gini index better? ›

A higher Gini index indicates greater inequality, with high-income individuals receiving much larger percentages of the population's total income. Global inequality, as measured by the Gini index, has steadily increased over the past few centuries and spiked during the COVID-19 pandemic.

Why is Gini coefficient ineffective? ›

The Gini coefficient's main weakness as a measure of income distribution is that it is incapable of differentiating different kinds of inequalities. Lorenz curves may intersect, reflecting differing patterns of income distribution, but nevertheless resulting in very similar Gini coefficient values.

What are two steps of tree pruning work? ›

It is of 2 types prepruning and post pruning.

Why do we prune decision trees? ›

Advantages of Pruning a Decision Tree

Pruning reduces the complexity of the final tree and thereby reduces overfitting.

What are the two types of pruning in decision tree? ›

How do you Prune a Decision Tree? There are two types of pruning: Pre-pruning and Post-pruning.

What is entropy in decision tree? ›

Entropy is an information theory metric that measures the impurity or uncertainty in a group of observations. It determines how a decision tree chooses to split data. The image below gives a better description of the purity of a set. Source. Consider a dataset with N classes.

Which is better Gini or information gain? ›

Comparing the Splitting Methods

The Gini Impurity favours bigger partitions (distributions) and is simple to implement, whereas information gains favour smaller partitions (distributions) with a variety of diverse values, necessitating a data and splitting criterion experiment.

What is tree pruning in data mining? ›

Pruning means to change the model by deleting the child nodes of a branch node. The pruned node is regarded as a leaf node. Leaf nodes cannot be pruned. A decision tree consists of a root node, several branch nodes, and several leaf nodes. The root node represents the top of the tree.

Is higher Gini better or worse? ›

The Gini coefficient ranges from 0, indicating perfect equality (where everyone receives an equal share), to 1, perfect inequality (where only one recipient or group of recipients receives all the income).

Why is it called the Gini index? ›

Gini developed his coefficient in 1912, building on the work of American economist Max Lorenz, who published a hypothetical way to depict total equality - a straight diagonal line on a graph - in 1905. The difference between this hypothetical line and the actual line produced of people's incomes is the Gini ratio.

What is information entropy in simple words? ›

Information entropy is a concept from information theory. It tells how much information there is in an event. In general, the more certain or deterministic the event is, the less information it will contain. More clearly stated, information is an increase in uncertainty or entropy.

What is the difference between information and entropy? ›

Information provides a way to quantify the amount of surprise for an event measured in bits. Entropy provides a measure of the average amount of information needed to represent an event drawn from a probability distribution for a random variable.

Does high entropy mean high information gain? ›

A high entropy means low information gain, and a low entropy means high information gain. Information gain can be thought of as the purity in a system: the amount of clean knowledge available in a system.

What is an example of information entropy? ›

Information entropy is a measure of how much information there is in some specific data. It isn't the length of the data, but the actual amount of information it contains. For example, one text file could contain “Apples are red.” and another text file could contain “Apples are red. Apples are red.

What is the relationship between information gain and information gain ratio? ›

In decision tree learning, Information gain ratio is a ratio of information gain to the intrinsic information.
...
Difference From Information Gain.
Information GainInformation Gain Ratio
Will not favor any attributes by number of distinct valuesWill favor attribute that have a lower number of distinct values
1 more row

What are the different types of decision trees? ›

Decision trees can be divided into two types; categorical variable and continuous variable decision trees.

What is a decision tree in simple terms? ›

A decision tree is a type of supervised machine learning used to categorize or make predictions based on how a previous set of questions were answered. The model is a form of supervised learning, meaning that the model is trained and tested on a set of data that contains the desired categorization.

What is a good Gini index? ›

A Gini index of 0 represents perfect equality, while an index of 100 implies perfect inequality. Long definition. Gini index measures the extent to which the distribution of income (or, in some cases, consumption expenditure) among individuals or households within an economy deviates from a perfectly equal distribution ...

What is an example of a Gini index? ›

Answer: The gini for Small shirt size is 0.48, Medium shirt size is 0.4898, Large shirt size is 0.5, and Extra Large shirt size is 0.5. Therefore, the Gini index for Shirt Size attribute is 0.4914.

What is a good Gini for a model? ›

Gini Coefficient

Gini above 60% is a good model.

What is the pruning rule? ›

The 1/3 pruning rule is a useful guideline when it comes to giving established shrubs and small trees a trim. The notion is that no more than one-third of healthy growth should be removed at one time – whether that's removing one-out-of-three older stems completely, or cutting back each stem or branch by a third.

Is tree thinning the same as pruning? ›

Thinning is done on seedlings “to reduce the number of plants and allow growth of a select few.” When trees need to be pruned, make the cuts vertical or diagonal with the cut on the underside to keep water from sitting on and rotting out the wood.

What is pruning and when do you use it? ›

pruning, in horticulture, the removal or reduction of parts of a plant, tree, or vine that are not requisite to growth or production, are no longer visually pleasing, or are injurious to the health or development of the plant.

What are 3 changes that result in an entropy increase? ›

The phases of matter in order of increasing entropy are solid, liquid, then gas. The processes that increase entropy by changing phases will cause a phase transition from lower entropy to higher entropy. These transitions are melting (solid to liquid), vaporization (liquid to gas), and sublimation (solid to gas).

What happens if entropy stops increasing? ›

Following the increase of entropy, the dissipation of matter and energy goes on until our universe becomes so infinitely disordered that entropy can no longer increase and events come to an end. This is called the heat death of the universe. Some say that, because things cannot get any worse, nothing happens at all.

Does entropy want to increase or decrease? ›

As time goes by, it likely will become more disordered and thus its entropy will increase (see figure below). The natural tendency of a system is for its entropy to increase.

What is information gain in decision trees? ›

Information gain is the basic criterion to decide whether a feature should be used to split a node or not. The feature with the optimal split i.e., the highest value of information gain at a node of a decision tree is used as the feature for splitting the node.

What are the differences between the information gain and the entropy of a dataset? ›

The information gain is the amount of information gained about a random variable or signal from observing another random variable. Entropy is the average rate at which information is produced by a stochastic source of data, Or, it is a measure of the uncertainty associated with a random variable.

What does the information gain measure in a decision tree algorithm? ›

Information gain measures the reduction of uncertainty given some feature and it is also a deciding factor for which attribute should be selected as a decision node or root node. It is just entropy of the full dataset – entropy of the dataset given some feature.

What is the difference between information gain and Gini Index? ›

Gini Index vs Information Gain

Gini index is measured by subtracting the sum of squared probabilities of each class from one, in opposite of it, information gain is obtained by multiplying the probability of the class by log ( base= 2) of that class probability.

What is the difference between post-pruning and pre-pruning? ›

As the names suggest, pre-pruning or early stopping involves stopping the tree before it has completed classifying the training set and post-pruning refers to pruning the tree after it has finished.

Why information gain is better than entropy? ›

Information gain uses entropy to make decisions. If the entropy is less, information will be more. Information gain is used in decision trees and random forest to decide the best split. Thus, the more the information gain the better the split and this also means lower the entropy.

What are the four main elements of a proper decision tree? ›

It's important to note that a proper decision tree has four main elements: decision nodes, chance nodes, end nodes, and branches.

What is the best explanation of decision trees? ›

A decision tree is a type of supervised machine learning used to categorize or make predictions based on how a previous set of questions were answered. The model is a form of supervised learning, meaning that the model is trained and tested on a set of data that contains the desired categorization.

What is the relationship between information gain and entropy? ›

The greater the information gain, the greater the decrease in entropy or uncertainty.

What is the Gini Index in data science? ›

Use of Gini index in data modelling

The Gini Coefficient or Gini Index measures the inequality among the values of a variable. Higher the value of an index, more dispersed is the data. Alternatively, the Gini coefficient can also be calculated as the half of the relative mean absolute difference.

What is the Gini Index for classification? ›

Gini Index, also known as Gini impurity, calculates the amount of probability of a specific feature that is classified incorrectly when selected randomly. If all the elements are linked with a single class then it can be called pure.

What is Gini Index feature selection? ›

The feature selection bias of the Gini-Index in feature selection is shown in three ways: 1) the Gini values of low-frequency features are low (on purity measure) overall, irrespective of the distribution of features among classes, 2) for high-frequency features, the Gini values are always relatively high and 3) for ...

What are the metrics for decision tree classification? ›

Anyway, both the Gini Index and the Entropy and Information gain metrics are the metrics to use in the algorithm to create a decision tree. Both algorithms use a greedy function to find the best features, which means it could take a lot of time, the bigger the dataset.

What is the Gini impurity? ›

What is Gini Impurity? Gini Impurity is a measurement used to build Decision Trees to determine how the features of a dataset should split nodes to form the tree.

Videos

1. Tutorial 37: Entropy In Decision Tree Intuition
(Krish Naik)
2. Decision Trees Geometric Intuition | Entropy | Gini impurity | Information Gain
(CampusX)
3. Tutorial 39- Gini Impurity Intuition In Depth In Decision Tree
(Krish Naik)
4. When To Use Entropy Vs When To Use Gini Impurity In Decision Tree- Asked In Interviews
(Krish Naik)
5. Decision Tree Pruning explained (Pre-Pruning and Post-Pruning)
(Sebastian Mantey)
6. How To Perform Post Pruning In Decision Tree? Prevent Overfitting- Data Science
(Krish Naik)

References

Top Articles
Latest Posts
Article information

Author: Kelle Weber

Last Updated: 05/08/2023

Views: 5916

Rating: 4.2 / 5 (73 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Kelle Weber

Birthday: 2000-08-05

Address: 6796 Juan Square, Markfort, MN 58988

Phone: +8215934114615

Job: Hospitality Director

Hobby: tabletop games, Foreign language learning, Leather crafting, Horseback riding, Swimming, Knapping, Handball

Introduction: My name is Kelle Weber, I am a magnificent, enchanting, fair, joyous, light, determined, joyous person who loves writing and wants to share my knowledge and understanding with you.