trees are intuitive, flowchart-like models widely used in machine learning.
In machine learning, they serve as a fundamental building block for more complex ensemble models like Random Forests and Gradient Boosting Machines.
In this article, I’ll explore their mechanism with walkthrough examples.
What is a Decision Tree?
A decision tree constructs a tree structure used for both regression and classification problems in various sectors like finance, medicine, and education.
Decision trees’ key features include:
- A hierarchical tree-structured (Cf. Boosting: sequential structure).
- Non-parametric: No fixed number of parameters, no assumptions on data distribution.
- Versatile: Effectively used for both regression and classification.
- Impurity-Based Optimization: Optimize splitting points of a given dataset to maximize purity (or minimize impurity) of child nodes.
- To compute the purity, takes impurity measures as a decision function.
- Recursive Data Splitting: Uses greedy algorithms to optimize the data split until it reaches a point where the data in each node achieves maximum purity.
- Interpretable: The resulting tree provides clear, interpretable decision rules that can be readily understood and applied for predictions or classifications.
How the Decision Tree Algorithm Works
The decision tree algorithm aims to identify optimal split points that categorize data into discrete groups.
Specifically, the process begins at the root node, which represents the entire dataset.
From the root, the data is successively split by decision nodes based on specific features, forming branches.
This splitting continues until the structure reaches leaf nodes (or called terminal nodes).
Each leaf node represents an outcome or a decision.
Objective of Tree Algorithm
Tree algorithms’ primary goal is to find a split point in the dataset that can achieve maximum purity (or minimum impurity) in the child nodes.
How to Define the “Purity”
Purity means creating subsets of data as homogeneous as possible with respect to the target variable.
The figure below illustrates a child node with high impurity (left) and and maximum purity (right) which is chosen as the optimal split point:

In this section, I’ll walk you through Entropy and Gini Impurity. Both are impurity measures, but optimize different metrics.
1. Entropy: Maximizing Information Gain
Entropy quantifies the amount of uncertainty or randomness in a dataset; how mixed the classes are within the node.
Entropy ranges from zero (perfect purity) to log2(C)
(perfect impurity where the samples in the node are evenly distributed among the classes).
- High Entropy = High Uncertainty = Lots of ”surprise” potential. For instance, flipping a fair coin has 50:50 chance of heads or tails. You don’t know if it’s heads or tails.
- Low Entropy = Low Uncertainty = Little to no ”surprise” potential. For instance, picking a yellow ball from a bag with full of yellow balls. You know what you’ll get. 100%.
If an algorithm uses Entropy, it aims to maximize Information Gain, which means maximizing the reduction in entropy from the parent node to the child nodes.
The Information Gain
Information gain quantifies a feature’s effectiveness within a tree structure, indicating its usefulness in separating data into distinct classes.
This usefulness is defined by how much a data split reduces impurity during classification.
2. Gini Impurity: Maximizing Gini Gain
Gini Impurity quantifies how often a randomly chosen element from the set would be incorrectly labeled.
Key characteristics of Gini Impurity include:
- Ranges from zero (perfect purity) to
1−1/C
(perfect impurity), - A lower Gini impurity indicates a more homogeneous node.
- Computationally faster than entropy-based methods.
If an algorithm uses Gini Impurity, it aims to maximize Gini Gain (or minimize Gini Impurity), meaning maximizing the reduction in Gini impurity from a parent node to its child nodes.
The Gini Gain
Gini gain is calculated by summing the squared probabilities of each class, then subtracting this sum from the parent’s Gini gain.
Gini gain is the default criterion used by the CART (Classification and Regression Trees) algorithm, which is widely implemented in ML libraries like Scikit-learn.
***
Defining how to compute the gain is crucial for how decision tree algorithms learn and decide the optimal splits at each node to create effective predictive models.
Greedy Algorithm to Maximize Gain
After defining an entropy measure to quantify the gain, the model leverages greedy algorithms to find optimal split points where the algorithm generates the highest gain in a tree-based model.
In this section, I’ll provide an overview of three major algorithms, then walk through examples in a later section.
1. Exact Greedy Algorithm
This method considers every unique value of a continuous feature (features with continuous values) as a potential split point.
- Pros: Can precisely calculate the gain for each, leading to the most accurate split finding.
- Cons: Computationally intensive and slow for large datasets with many unique values.
2. Approximate Greedy Algorithm
This method addresses scalability by proposing candidate split points based on quantiles of feature distributions.
It buckets continuous features and searches these approximate ranges for optimal splits, balancing accuracy with speed.
- Pros: Faster than the exact algorithm. Good at dealing with massive dataset not fit in memory to create full histograms.
- Cons: Less precise and potentially less optimal than the Histogram method.
3. Histogram-based Algorithm
This method pre-bins continuous features into fixed discrete bins (histograms).
Splits are then rapidly found by iterating through these pre-built histograms, offering significant speedups at the cost of some precision.
- Pros: Most efficient methods for large datasets.
- Cons: Might miss the absolute perfect split by a tiny margin (yet, practically often negligible).
The Walkthrough Example
Taking the following scenario for an example, I’ll go through a step-by-step approach on how a decision tree works using the three algorithms.
Scenario: Predicting Customer Subscription
Manager’s Goal:
A subscription service manager wants to predict which new sign-ups are most likely to convert into long-term subscribers (target variable: is_long_term
: Yes/No).
They have access to data on internet_usage_hrs_day
(how much time a user spends online daily) and device_preference
(their primary device: Mobile, Desktop, Tablet).
The Data:
ID | internet_usage_hrs_day | device_preference | is_long_term |
1 | 1.2 | Mobile | No |
2 | 2.8 | Desktop | No |
3 | 3.1 | Mobile | Yes |
4 | 4.5 | Desktop | Yes |
5 | 5.9 | Mobile | Yes |
6 | 6.3 | Desktop | Yes |
7 | 7.7 | Tablet | Yes |
8 | 8.4 | Mobile | No |
9 | 9.1 | Desktop | No |
10 | 10.5 | Tablet | Yes |
* internet_usage_hrs_day
contains continuous, device_preference
contains discrete values.
Step 1. Compute Initial Impunity
The root node begin with processing the entire dataset:
- Total Samples: 10
- Class Distribution: 6 ‘Yes’, 4 ‘No’
The initial proportion of samples belonging to each class (Yes, No) is:
- p_Yes = 6/10 = 0.6
- p_No = 4/10 = 0.4
Substituting these values to the impurity measure:
- Initial Entropy: E(D) = −(0.6 log_2(0.6)+0.4 log_2 (0.4)) ≈ 0.971
- Initial Gini Impurity: G(D) = 1 − (0.6² + 0.4²) = 0.48
*For demonstration purpose, I’ll compute both measures. In reality, we’ll choose one and stick to it until the end of the process.
Step 2. Choosing the Optimization Algorithm
The manager selected the optimization algorithm based on their varying priorities.
Scenario 1.
Manager: “For this initial pilot, our dataset is quite small, and I don’t think computational expense would be a problem. I prioritize achieving the most accurate split and generate insightful decision rules first.”
→ Choose Exact Greedy Algorithm
Scenario 2.
Manager: “If this pilot scales to millions of users, an exact search would be too slow. Let me start the process with random split points instead of all the splits to see if we still get a good enough split without freezing our system.”
→ Choose Approximate Greedy Algorithm
Scenario 3.
Manager: “I’m considering expanding our datasets with many more features to gain deeper insights. The expansion would require intensive model training. So, I need to prioritize the training speed. I’m willing to accept a slight loss of precision in split points for the speed increase.”
→ Choose Histogram-based Algorithm
Step 3. Computing Gains
I’ll demonstrate how to compute gains and decide which feature to prioritize to make a prediction on long-term subscribers.
The common steps to take across the scenarios is:
- Select split points by feature,
- Compute entropy or gini – then compute gains, and
- Choose the split point with the maximum gain.
Scenario 1. Apply Exact Greedy Algorithm
Feature 1: internet_usage_hrs_day
With float values, the number of potential split points technically increases. In practice, algorithms usually consider midpoints between sorted, distinct values where the target class changes:
- Split 1:
internet_usage_hrs_day <= 2.95
(midpoint of 2.8 and 3.1) - Split 2:
internet_usage_hrs_day <= 8.05
(midpoint of 7.7 and 8.4)

I’ll evaluate these splits by computing entropy and gini, and then compute the gains.
Split 1.
Left Child
- 2 samples: 2 ‘No’, 0 ‘Yes’
- p_No = 2/2=1
- p_Yes = 0/2=0
- Entropy: E(D) = 0 (Perfectly pure)
- Gini: G(D) = 0 (Perfectly pure)
Right Child
- 8 samples: 2 ‘No’, 6 ‘Yes’
- p_No = 2/8 = 0.25
- p_Yes = 6/8 = 0.75
- Entropy: E(D) = −((0.25) log2(0.25)+(0.75)log2(0.75))≈ −(−0.5−0.311) ≈0.811
- Gini: G(D) = 1 − (0.252+0.752) = 1 − 0.625 = 0.375
→ Gains
- Information Gain (Entropy): IG = 0.971 − (2/10 × 0+ 8/10 × 0.811) = 0.971 − 0.649 =0.322
- Gini Gain (Gini Impurity): ΔG = 0.48 − (2/10×0+8/10×0.375) = 0.48 – 0.30 = 0.180
The following tree illustrates the computation of information gain (left) and Gini gain (right) from the root node to the first level:

Split 2.
Left Child
- 7 samples: 2 ‘No’, 5 ‘Yes’
- p_No = 2/7
- p_Yes = 5/7
- Entropy: E(D)=− ((2/7) log2(2/7) + (5/7) log2(5/7) ) ≈ −(−0.517 − 0.346) ≈ 0.863
- Gini: G(D)=1− ((2/7)²+(5/7)²) = 1 − 29/49 ≈0.408
Right Child
- 3 samples: 2 ‘No’, 1 ‘Yes’
- p_No = 2/3
- p_Yes = 1/3
- Entropy: E(D) = −((2/3) log2(2/3) + (1/3) log2(1/3) ) ≈ −(−0.390 − 0.528) ≈ 0.918
- Gini: G(D) = 1−((2/3)² + (1/3)²) = 1 − 5/9 ≈ 0.444
→ Gains
- Information Gain: IG = 0.971 − ( 7/10 × 0.863+ 3/10 × 0.918 ) = 0.971 − 0.879 = 0.092
- Gini Gain: ΔG=0.48 − ( 7/10 × 0.408 + 3/10 × 0.444 ) = 0.48 − 0.419 = 0.061
Feature 2: device_preference
For categorical features, each category can be assigned to a separate node for potential splitting:
Split 1: Node ‘Mobile’
- 3 Samples: 2 ‘No’, 1 ‘Yes’
- p_Yes = 1/3
- p_No = 2/3
- Entropy ≈ 0.918
- Gini ≈ 0.444
Split 2: Node ‘Desktop’
- 4 Samples: 3 ‘No’, 1 ‘Yes’
- p_Yes = 1/4
- p_No = 3/4
- Entropy ≈ 0.811
- Gini ≈ 0.375
Split 3: Node ‘Tablet’
- 2 Samples: 0 ‘No’, 2 ‘Yes’
- p_Yes = 2/2 = 1
- p_No = 0/2 = 0
- Entropy = 0 (purity)
- Gini = 0 (purity)
→ Gains:
- Information Gain: IG = 0.971 − (3/10 × 0.918 + 4/10 × 0.811 + 2/10 × 0) = 0.971 − 0.599 = 0.372
- Gini Gain: ΔG = 0.48 − (3/10 × 0.444 + 4/10 x 0.375 + 2/10 × 0) = 0.48 – 0.283 = 0.197
Decision Making
In either case of Entropy or Gini impurity, it turned out splitting by category of device_preference
is the optimal split as it outperformed the gains of internet_usage_hrs_day
.
This result indicates that in this dataset, device preference is a stronger predictor than daily internet usage hours for whether a customer becomes a long-term subscriber.
Impurity Method | Feature | Breaking Point | Gain |
Entropy | Internet_Usage_Hrs_Day | ≤2.95 | 0.322 |
Entropy | Internet_Usage_Hrs_Day | ≤8.05 | 0.092 |
Entropy | Device_Preference | Mobile, Desktop, Tablet | 0.372 ← WINNER |
Gini | Internet_Usage_Hrs_Day | ≤2.95 | 0.180 |
Gini | Internet_Usage_Hrs_Day | ≤8.05 | 0.061 |
Gini | Device_Preference | Mobile, Desktop, Tablet | 0.197 ← WINNER |
(Note – Gini and Entropy shouldn’t be compared directly as their mathematical formulations differ as we saw in the previous section.)
Scenario 2. Apply Approximate Greedy Algorithm
Feature 1: internet_usage_hrs_day
For numerical features, approximate greedy algorithms evaluate only a random subset of potential split points to save computation.
So, the manager decides to split at random points of 5.0 and 9.0.
- Split 1:
internet_usage_hrs_day <= 5.0
- Split 2:
internet_usage_hrs_day <= 9.0

Split 1.
Left Child
- 5 samples: 2 ‘No’, 3 ‘Yes’
- p_YES = 3/5 = 0.6
- p_No = 2/5 = 0.4 (happens to be the same distribution as the root node)
- Entropy ≈ 0.971
- Gini = 0.48
Right Child
- 5 samples: 2 ‘No’, 3 ‘Yes’
- p_YES = 3/5 = 0.6
- p_No = 2/5 = 0.4
- Entropy ≈ 0.971
- Gini = 0.48
→ Gains:
- Information Gain: IG=0.971−(5 / 10 × 0.971 + 5 / 10 ×0.971) = 0.971 – 0.971 = 0.0
- Gini Gain: ΔG=0.48−(5 / 10 × 0.48 + 5 / 10 × 0.48) = 0.48 -0.48 = 0.0
Split 2
Left Child
- 9 samples: 4 ‘No’, 5 ‘Yes’
- p_YES = 5/9
- p_No = 4/9
- Entropy ≈0.991
- Gini ≈0.494
Right Child
- 1 sample: 0 ‘No’, 1 ‘Yes’
- p_YES = 1
- p_No = 0
- Entropy = 0
- Gini = 0
→ Gains:
- Information Gain: IG= 0.971 − (9 / 10 × 0.991 + 1 / 10 × 0) = 0.971 − 0.892 = 0.079
- Gini Gain: ΔG = 0.48−(9 / 10 × 0.494 + 1 / 10 × 0) = 0.48 – 0.445 = 0.035
Feature 2: device_preference
Same as Exact Greedy, for categorical features, it typically still evaluates all categories. So, I’ll utilize the results from Scenario 1.
- Information Gain: IG = 0.372
- Gini Gain: ΔG = 0.197
Decision Making
Similar to Scenario 1, in either impurity measure, splitting by the category of device_preference
performed the best.
Impurity Method | Feature | Breaking Point | Gain |
Entropy | internet_usage_hrs_day | ≤5.0 | 0.0 |
Entropy | internet_usage_hrs_day | ≤9.0 | 0.079 |
Entropy | device_preference | Mobile, Desktop, Tablet | 0.372 ← Best |
Gini | internet_usage_hrs_day | ≤5.0 | 0.0 |
Gini | internet_usage_hrs_day | ≤9.0 | 0.035 |
Gini | device_preference | Mobile, Desktop, Tablet | 0.197 ← Best |
Scenario 3. Apply Histogram-based Algorithm
Feature 1: internet_usage_hrs_day
The histogram-based greedy algorithm first bins numerical features into a fixed number of bins and then only considers splits at the bin boundaries.
The manager decides to make 3 bins: Hrs <= 3.0
, 3.0 < Hrs <= 7.0
, Hrs > 7.0
so that they only consider splits at Hrs <= 3.0
and Hrs <= 7.0
.
Split 1. Hrs <= 3.0
Left Child
- 2 samples: 2 ‘No’, 0 ‘Yes’
- p_Yes = 0
- p_No = 1
- Entropy = 0
- Gini = 0
Right Child
- 8 samples: 2 ‘No’, 6 ‘Yes’
- p_Yes = 6/8
- p_No = 2/8
- Entropy ≈ 0.811
- Gini ≈ 0.375
→ Gains:
- Information Gain: IG = 0.971 − ( 2/10 × 0 + 8/10 × 0.811 ) = 0.971 – 0.649 = 0.322
- Gini Gain: ΔG=0.48 − (2/10 × 0 + 8/10 × 0.375 ) = 0.48− 0.3 = 0.180
Split 2. Hrs <= 7.0
Left Child
- 7 samples: 2 ‘No’, 5 ‘Yes’
- p_Yes = 5/7
- p_No = 2/7
- Entropy ≈ 0.863
- Gini ≈ 0.408
Right Child
- 3 samples: 2 ‘No’, 1 ‘Yes’
- p_Yes = 1/3
- p_No = 2/3
- Entropy ≈ 0.918
- Gini ≈ 0.444
→ Gains:
- Information Gain: IG = 0.971 − (7/10 × 0.863 + 3/10 × 0.918) = 0.971 – 0.879 = 0.092
- Gini Gain: ΔG=0.48 − (7/10 × 0.408 + 3/10 × 0.444) = 0.48− 0.419 = 0.061
Feature 2: device_preference
The histogram-based algorithm takes the same approach as Exact / Approximate Greedy.
- Information Gain: IG = 0.372
- Gini Gain: ΔG = 0.197
Decision Making
Similar to the other scenarios, splitting by the category of device_preference
performed the best.
However, for numerical feature internet_usage_hrs_day
, the split at hrs <= 3.0
resulted in an Entropy Gain of 0.322 and Gini Gain of 0.180.
Interestingly, these gains are identical to the highest gains found by the Exact Greedy algorithm for internet_usage_hrs_day
(at ≤2.95).
This demonstrates that if bin boundaries align closely with optimal split points, the Histogram-based method can achieve similar results with much higher efficiency.
Impurity Method | Feature | Breaking Point | Gain |
Entropy | internet_usage_hrs_day | ≤3.0 | 0.322 |
Entropy | internet_usage_hrs_day | ≤7.0 | 0.092 |
Entropy | device_preference | Mobile, Desktop, Tablet | 0.372 |
Gini | internet_usage_hrs_day | ≤3.0 | 0.180 |
Gini | internet_usage_hrs_day | ≤7.0 | 0.061 |
Gini | device_preference | Mobile, Desktop, Tablet | 0.197 ← Best |
Summary of Findings
Across all three greedy algorithms, device_preference
consistently yields the highest gain and thus represents the optimal split point for the initial node.
Dominance of Categorical Feature
- In Exact Greedy,
device_preference
achieved an Information Gain of 0.372 and a Gini Gain of 0.197 – highest gains across all evaluated splits. - In Approximate Greedy,
device_preference
maintained its Information Gain of 0.372 and Gini Gain of 0.197, again surpassing all tested splits. - In the Histogram-based Greedy approach,
device_preference
continued to exhibit gains of 0.372 (Entropy) and 0.197 (Gini), outperforming its numerical counterparts.
Variations in Numerical Feature Handling
- Exact Greedy performed an exhaustive search, finding the highest gains for
Internet_Usage_Hrs_Day
at the ≤2.95 split (Entropy: 0.322, Gini: 0.180).
This represents the maximum gain achievable for this feature under its specified split derivation method.
- Approximate Greedy, by evaluating only a random subset of split points (e.g., ≤5.0 and ≤9.0), yielded significantly lower gains. This clearly demonstrates the trade-off, where a simplified search missed the higher gain possible for the numerical feature.
- Histogram-based Greedy, by evaluating at predefined bin boundaries (e.g., ≤3.0 and ≤7.0), remarkably found a split at ≤3.0 that produced an Entropy Gain of 0.322 and Gini Gain of 0.180.
These gains are identical to the maximum achieved by the Exact Greedy algorithm for this feature, indicating that the chosen bin boundary aligned perfectly with a highly effective split point in this particular dataset.
Conclusion
Decision trees are versatile and interpretable models that make predictions by navigating importance of input features.
In the experiment, we observed that the choice of algorithm and impurity measure significantly influences a decision tree’s structure and predictive performance.
Conversely, we also observed that decision trees are prone to overfitting.
When trees are used in machine learning, models employ regularization techniques to improve generalization capabilities and prevent overfitting.
Author: Kuriko IWAI
All images, unless otherwise noted, are by the author. The article utilizes synthetic datasets.