Home » 3 Greedy Algorithms for Decision Trees, Explained with Examples

3 Greedy Algorithms for Decision Trees, Explained with Examples

trees are intuitive, flowchart-like models widely used in machine learning.

In machine learning, they serve as a fundamental building block for more complex ensemble models like Random Forests and Gradient Boosting Machines.

In this article, I’ll explore their mechanism with walkthrough examples.

What is a Decision Tree?

A decision tree constructs a tree structure used for both regression and classification problems in various sectors like finance, medicine, and education.

Decision trees’ key features include:

  • A hierarchical tree-structured (Cf. Boosting: sequential structure).
  • Non-parametric: No fixed number of parameters, no assumptions on data distribution.
  • Versatile: Effectively used for both regression and classification.
  • Impurity-Based Optimization: Optimize splitting points of a given dataset to maximize purity (or minimize impurity) of child nodes.
  • To compute the purity, takes impurity measures as a decision function.
  • Recursive Data Splitting: Uses greedy algorithms to optimize the data split until it reaches a point where the data in each node achieves maximum purity.
  • Interpretable: The resulting tree provides clear, interpretable decision rules that can be readily understood and applied for predictions or classifications.

How the Decision Tree Algorithm Works

The decision tree algorithm aims to identify optimal split points that categorize data into discrete groups.

Specifically, the process begins at the root node, which represents the entire dataset.

From the root, the data is successively split by decision nodes based on specific features, forming branches.

This splitting continues until the structure reaches leaf nodes (or called terminal nodes).

Each leaf node represents an outcome or a decision.

Fig. Decision Tree Structure (Created by Kuriko IWAI)

Objective of Tree Algorithm

Tree algorithms’ primary goal is to find a split point in the dataset that can achieve maximum purity (or minimum impurity) in the child nodes.

How to Define the “Purity”

Purity means creating subsets of data as homogeneous as possible with respect to the target variable.

The figure below illustrates a child node with high impurity (left) and and maximum purity (right) which is chosen as the optimal split point:

Fig. Finding the optimal split with maximum purity (Created by Kuriko IWAI)
Fig. Finding the optimal split with maximum purity (Created by Kuriko IWAI)

In this section, I’ll walk you through Entropy and Gini Impurity. Both are impurity measures, but optimize different metrics.

1. Entropy: Maximizing Information Gain

Entropy quantifies the amount of uncertainty or randomness in a dataset; how mixed the classes are within the node.

Entropy ranges from zero (perfect purity) to log2​(C) (perfect impurity where the samples in the node are evenly distributed among the classes).

  • High Entropy = High Uncertainty = Lots of ”surprise” potential. For instance, flipping a fair coin has 50:50 chance of heads or tails. You don’t know if it’s heads or tails.
  • Low Entropy = Low Uncertainty = Little to no ”surprise” potential. For instance, picking a yellow ball from a bag with full of yellow balls. You know what you’ll get. 100%.

If an algorithm uses Entropy, it aims to maximize Information Gain, which means maximizing the reduction in entropy from the parent node to the child nodes.

The Information Gain

Information gain quantifies a feature’s effectiveness within a tree structure, indicating its usefulness in separating data into distinct classes.

This usefulness is defined by how much a data split reduces impurity during classification.

2. Gini Impurity: Maximizing Gini Gain

Gini Impurity quantifies how often a randomly chosen element from the set would be incorrectly labeled.

Key characteristics of Gini Impurity include:

  • Ranges from zero (perfect purity) to 1−1/C (perfect impurity)​,
  • A lower Gini impurity indicates a more homogeneous node.
  • Computationally faster than entropy-based methods.

If an algorithm uses Gini Impurity, it aims to maximize Gini Gain (or minimize Gini Impurity), meaning maximizing the reduction in Gini impurity from a parent node to its child nodes.

The Gini Gain

Gini gain is calculated by summing the squared probabilities of each class, then subtracting this sum from the parent’s Gini gain.

Gini gain is the default criterion used by the CART (Classification and Regression Trees) algorithm, which is widely implemented in ML libraries like Scikit-learn.

***

Defining how to compute the gain is crucial for how decision tree algorithms learn and decide the optimal splits at each node to create effective predictive models.

Greedy Algorithm to Maximize Gain

After defining an entropy measure to quantify the gain, the model leverages greedy algorithms to find optimal split points where the algorithm generates the highest gain in a tree-based model.

In this section, I’ll provide an overview of three major algorithms, then walk through examples in a later section.

1. Exact Greedy Algorithm

This method considers every unique value of a continuous feature (features with continuous values) as a potential split point.

  • Pros: Can precisely calculate the gain for each, leading to the most accurate split finding.
  • Cons: Computationally intensive and slow for large datasets with many unique values.

2. Approximate Greedy Algorithm

This method addresses scalability by proposing candidate split points based on quantiles of feature distributions.

It buckets continuous features and searches these approximate ranges for optimal splits, balancing accuracy with speed.

  • Pros: Faster than the exact algorithm. Good at dealing with massive dataset not fit in memory to create full histograms.
  • Cons: Less precise and potentially less optimal than the Histogram method.

3. Histogram-based Algorithm

This method pre-bins continuous features into fixed discrete bins (histograms).

Splits are then rapidly found by iterating through these pre-built histograms, offering significant speedups at the cost of some precision.

  • Pros: Most efficient methods for large datasets.
  • Cons: Might miss the absolute perfect split by a tiny margin (yet, practically often negligible).

The Walkthrough Example

Taking the following scenario for an example, I’ll go through a step-by-step approach on how a decision tree works using the three algorithms.

Scenario: Predicting Customer Subscription

Manager’s Goal:

A subscription service manager wants to predict which new sign-ups are most likely to convert into long-term subscribers (target variable: is_long_term: Yes/No).

They have access to data on internet_usage_hrs_day (how much time a user spends online daily) and device_preference (their primary device: Mobile, Desktop, Tablet).

The Data:

ID internet_usage_hrs_day device_preference is_long_term
1 1.2 Mobile No
2 2.8 Desktop No
3 3.1 Mobile Yes
4 4.5 Desktop Yes
5 5.9 Mobile Yes
6 6.3 Desktop Yes
7 7.7 Tablet Yes
8 8.4 Mobile No
9 9.1 Desktop No
10 10.5 Tablet Yes

* internet_usage_hrs_day contains continuous, device_preference contains discrete values.

Step 1. Compute Initial Impunity

The root node begin with processing the entire dataset:

  • Total Samples: 10
  • Class Distribution: 6 ‘Yes’, 4 ‘No’

The initial proportion of samples belonging to each class (Yes, No) is:

  • p_Yes = 6/10 = 0.6
  • p_No = 4/10 = 0.4

Substituting these values to the impurity measure:

  • Initial Entropy: E(D) = −(0.6 log​_2(0.6)+0.4 log_2 ​(0.4)) ≈ 0.971
  • Initial Gini Impurity: G(D) = 1 − (0.6² + 0.4²) = 0.48

*For demonstration purpose, I’ll compute both measures. In reality, we’ll choose one and stick to it until the end of the process.

Step 2. Choosing the Optimization Algorithm

The manager selected the optimization algorithm based on their varying priorities.

Scenario 1.

Manager: “For this initial pilot, our dataset is quite small, and I don’t think computational expense would be a problem. I prioritize achieving the most accurate split and generate insightful decision rules first.”

→ Choose Exact Greedy Algorithm

Scenario 2.

Manager: “If this pilot scales to millions of users, an exact search would be too slow. Let me start the process with random split points instead of all the splits to see if we still get a good enough split without freezing our system.”

→ Choose Approximate Greedy Algorithm

Scenario 3.

Manager: “I’m considering expanding our datasets with many more features to gain deeper insights. The expansion would require intensive model training. So, I need to prioritize the training speed. I’m willing to accept a slight loss of precision in split points for the speed increase.”

→ Choose Histogram-based Algorithm

Step 3. Computing Gains

I’ll demonstrate how to compute gains and decide which feature to prioritize to make a prediction on long-term subscribers.

The common steps to take across the scenarios is:

  1. Select split points by feature,
  2. Compute entropy or gini – then compute gains, and
  3. Choose the split point with the maximum gain.

Scenario 1. Apply Exact Greedy Algorithm

Feature 1: internet_usage_hrs_day

With float values, the number of potential split points technically increases. In practice, algorithms usually consider midpoints between sorted, distinct values where the target class changes:

  • Split 1: internet_usage_hrs_day <= 2.95 (midpoint of 2.8 and 3.1)
  • Split 2: internet_usage_hrs_day <= 8.05 (midpoint of 7.7 and 8.4)
Fig. Applying Exact Greedy Algorithm to the feature (Created by Kuriko IWAI)
Fig. Applying Exact Greedy Algorithm to the feature (Created by Kuriko IWAI)

I’ll evaluate these splits by computing entropy and gini, and then compute the gains.

Split 1.

Left Child

  • 2 samples: 2 ‘No’, 0 ‘Yes’
  • p_No ​= 2/2=1
  • p_Yes ​= 0/2=0
  • Entropy: E(D) = 0 (Perfectly pure)
  • Gini: G(D) = 0 (Perfectly pure)

Right Child

  • 8 samples: 2 ‘No’, 6 ‘Yes’
  • p_No​ = 2/8 = 0.25
  • p_Yes​ = 6/8 = 0.75
  • Entropy: E(D) = −((0.25) log2​(0.25)+(0.75)log2​(0.75))≈ −(−0.5−0.311) ≈0.811
  • Gini: G(D) = 1 − (0.252+0.752) = 1 − 0.625 = 0.375

→ Gains

  • Information Gain (Entropy): IG = 0.971 − (2/10 × 0+ 8/10 ​× 0.811) = 0.971 − 0.649 =0.322
  • Gini Gain (Gini Impurity): ΔG = 0.48 − (2/10​×0+8/10​×0.375) = 0.48 – 0.30 = 0.180

The following tree illustrates the computation of information gain (left) and Gini gain (right) from the root node to the first level:

Fig. Tree map of the subscription scenario based on exact greedy algorithm on Entropy (left) and Gini (right) impurity measures (Created by Kuriko IWAI)
Fig. Tree map of the subscription scenario based on exact greedy algorithm on Entropy (left) and Gini (right) impurity measures (Created by Kuriko IWAI)

Split 2.

Left Child

  • 7 samples: 2 ‘No’, 5 ‘Yes’
  • p_No​ = 2/7
  • p_Yes ​= 5/7
  • Entropy: E(D)=− ((2/7) log2​(2/7) + (5/7) log2​(5/7) ) ≈ −(−0.517 − 0.346) ≈ 0.863
  • Gini: G(D)=1− ((2/7)²+(5/7)²) = 1 − 29/49 ≈0.408

Right Child

  • 3 samples: 2 ‘No’, 1 ‘Yes’
  • p_No ​= 2/3
  • p_Yes ​= 1/3
  • Entropy: E(D) = −((2/3) log2​(2/3) + (1/3) log2​(1/3) ) ≈ −(−0.390 − 0.528) ≈ 0.918
  • Gini: G(D) = 1−((2/3)² + (1/3)²) = 1 − 5/9 ≈ 0.444

→ Gains

  • Information Gain: IG = 0.971 − ( 7/10​ × 0.863+ 3/10 ​× 0.918 ) = 0.971 − 0.879 = 0.092
  • Gini Gain: ΔG=0.48 − ( 7/10 ​× 0.408 + 3​/10 × 0.444 ) = 0.48 − 0.419 = 0.061

Feature 2: device_preference

For categorical features, each category can be assigned to a separate node for potential splitting:

Split 1: Node ‘Mobile’

  • 3 Samples: 2 ‘No’, 1 ‘Yes’
  • p_Yes = 1/3
  • p_No = 2/3
  • Entropy ≈ 0.918
  • Gini ≈ 0.444

Split 2: Node ‘Desktop’

  • 4 Samples: 3 ‘No’, 1 ‘Yes’
  • p_Yes = 1/4
  • p_No = 3/4
  • Entropy ≈ 0.811
  • Gini ≈ 0.375

Split 3: Node ‘Tablet’

  • 2 Samples: 0 ‘No’, 2 ‘Yes’
  • p_Yes = 2/2 = 1
  • p_No = 0/2 = 0
  • Entropy = 0 (purity)
  • Gini = 0 (purity)

→ Gains:

  • Information Gain: IG = 0.971 − (3/10 ​× 0.918 + 4/10 ​× 0.811 + 2/10 ​× 0) = 0.971 − 0.599 = 0.372
  • Gini Gain: ΔG = 0.48 − (3/10 ​× 0.444 + 4/10​ x 0.375 + 2/10 × 0) = 0.48 – 0.283 = 0.197

Decision Making

In either case of Entropy or Gini impurity, it turned out splitting by category of device_preference is the optimal split as it outperformed the gains of internet_usage_hrs_day.

This result indicates that in this dataset, device preference is a stronger predictor than daily internet usage hours for whether a customer becomes a long-term subscriber.

Impurity Method Feature Breaking Point Gain
Entropy Internet_Usage_Hrs_Day ≤2.95 0.322
Entropy Internet_Usage_Hrs_Day ≤8.05 0.092
Entropy Device_Preference Mobile, Desktop, Tablet 0.372 ← WINNER
Gini Internet_Usage_Hrs_Day ≤2.95 0.180
Gini Internet_Usage_Hrs_Day ≤8.05 0.061
Gini Device_Preference Mobile, Desktop, Tablet 0.197 ← WINNER

(Note – Gini and Entropy shouldn’t be compared directly as their mathematical formulations differ as we saw in the previous section.)


Scenario 2. Apply Approximate Greedy Algorithm

Feature 1: internet_usage_hrs_day

For numerical features, approximate greedy algorithms evaluate only a random subset of potential split points to save computation.

So, the manager decides to split at random points of 5.0 and 9.0.

  • Split 1: internet_usage_hrs_day <= 5.0
  • Split 2: internet_usage_hrs_day <= 9.0
Fig. Applying Approximate Greedy Algorithm to the feature (Created by Kuriko IWAI)
Fig. Applying Approximate Greedy Algorithm to the feature (Created by Kuriko IWAI)

Split 1.

Left Child

  • 5 samples: 2 ‘No’, 3 ‘Yes’
  • p_YES = 3/5 = 0.6
  • p_No = 2/5 = 0.4 (happens to be the same distribution as the root node)
  • Entropy ≈ 0.971
  • Gini = 0.48

Right Child

  • 5 samples: 2 ‘No’, 3 ‘Yes’
  • p_YES = 3/5 = 0.6
  • p_No = 2/5 = 0.4
  • Entropy ≈ 0.971
  • Gini = 0.48

→ Gains:

  • Information Gain: IG=0.971−(5 / 10 ​× 0.971 + 5 / 10 ​×0.971) = 0.971 – 0.971 = 0.0
  • Gini Gain: ΔG=0.48−(5 / 10 ​× 0.48 + 5 / 10 ​× 0.48) = 0.48 -0.48 = 0.0

Split 2

Left Child

  • 9 samples: 4 ‘No’, 5 ‘Yes’
  • p_YES = 5/9
  • p_No = 4/9
  • Entropy ≈0.991
  • Gini ≈0.494

Right Child

  • 1 sample: 0 ‘No’, 1 ‘Yes’
  • p_YES = 1
  • p_No = 0
  • Entropy = 0
  • Gini = 0

→ Gains:

  • Information Gain: IG= 0.971 − (9 / 10 ​× 0.991 + 1 / 10 ​× 0) = 0.971 − 0.892 = 0.079
  • Gini Gain: ΔG = 0.48−(9 / 10 ​× 0.494 + 1 / 10 ​× 0) = 0.48 – 0.445 = 0.035

Feature 2: device_preference

Same as Exact Greedy, for categorical features, it typically still evaluates all categories. So, I’ll utilize the results from Scenario 1.

  • Information Gain: IG = 0.372
  • Gini Gain: ΔG = 0.197

Decision Making

Similar to Scenario 1, in either impurity measure, splitting by the category of device_preference performed the best.

Impurity Method Feature Breaking Point Gain
Entropy internet_usage_hrs_day ≤5.0 0.0
Entropy internet_usage_hrs_day ≤9.0 0.079
Entropy device_preference Mobile, Desktop, Tablet 0.372 ← Best
Gini internet_usage_hrs_day ≤5.0 0.0
Gini internet_usage_hrs_day ≤9.0 0.035
Gini device_preference Mobile, Desktop, Tablet 0.197 ← Best

 

Scenario 3. Apply Histogram-based Algorithm

Feature 1: internet_usage_hrs_day

The histogram-based greedy algorithm first bins numerical features into a fixed number of bins and then only considers splits at the bin boundaries.

The manager decides to make 3 bins: Hrs <= 3.0, 3.0 < Hrs <= 7.0, Hrs > 7.0 so that they only consider splits at Hrs <= 3.0 and Hrs <= 7.0.

Split 1. Hrs <= 3.0

Left Child

  • 2 samples: 2 ‘No’, 0 ‘Yes’
  • p_Yes = 0
  • p_No = 1
  • Entropy = 0
  • Gini = 0

Right Child

  • 8 samples: 2 ‘No’, 6 ‘Yes’
  • p_Yes = 6/8
  • p_No = 2/8
  • Entropy ≈ 0.811
  • Gini ≈ 0.375

→ Gains:

  • Information Gain: IG = 0.971 − ( 2/10 ​× 0 + 8/10 ​× 0.811 ) = 0.971 – 0.649 = 0.322
  • Gini Gain: ΔG=0.48 − (2/10 ​× 0 + 8/10 ​× 0.375 ) = 0.48− 0.3 = 0.180

Split 2. Hrs <= 7.0

Left Child

  • 7 samples: 2 ‘No’, 5 ‘Yes’
  • p_Yes = 5/7
  • p_No = 2/7
  • Entropy ≈ 0.863
  • Gini ≈ 0.408

Right Child

  • 3 samples: 2 ‘No’, 1 ‘Yes’
  • p_Yes = 1/3
  • p_No = 2/3
  • Entropy ≈ 0.918
  • Gini ≈ 0.444

→ Gains:

  • Information Gain: IG = 0.971 − (7/10 ​× 0.863 + 3/10 ​× 0.918) = 0.971 – 0.879 = 0.092
  • Gini Gain: ΔG=0.48 − (7/10 ​× 0.408 + 3/10 ​× 0.444) = 0.48− 0.419 = 0.061

Feature 2: device_preference

The histogram-based algorithm takes the same approach as Exact / Approximate Greedy.

  • Information Gain: IG = 0.372
  • Gini Gain: ΔG = 0.197

Decision Making

Similar to the other scenarios, splitting by the category of device_preference performed the best.

However, for numerical feature internet_usage_hrs_day, the split at hrs <= 3.0 resulted in an Entropy Gain of 0.322 and Gini Gain of 0.180.

Interestingly, these gains are identical to the highest gains found by the Exact Greedy algorithm for internet_usage_hrs_day (at ≤2.95).

This demonstrates that if bin boundaries align closely with optimal split points, the Histogram-based method can achieve similar results with much higher efficiency.

Impurity Method Feature Breaking Point Gain
Entropy internet_usage_hrs_day ≤3.0 0.322
Entropy internet_usage_hrs_day ≤7.0 0.092
Entropy device_preference Mobile, Desktop, Tablet 0.372
Gini internet_usage_hrs_day ≤3.0 0.180
Gini internet_usage_hrs_day ≤7.0 0.061
Gini device_preference Mobile, Desktop, Tablet 0.197 ← Best

Summary of Findings

Across all three greedy algorithms, device_preference consistently yields the highest gain and thus represents the optimal split point for the initial node.

Dominance of Categorical Feature

  • In Exact Greedy, device_preference achieved an Information Gain of 0.372 and a Gini Gain of 0.197 – highest gains across all evaluated splits.
  • In Approximate Greedy, device_preference maintained its Information Gain of 0.372 and Gini Gain of 0.197, again surpassing all tested splits.
  • In the Histogram-based Greedy approach, device_preference continued to exhibit gains of 0.372 (Entropy) and 0.197 (Gini), outperforming its numerical counterparts.

Variations in Numerical Feature Handling

  • Exact Greedy performed an exhaustive search, finding the highest gains for Internet_Usage_Hrs_Day at the ≤2.95 split (Entropy: 0.322, Gini: 0.180).
    This represents the maximum gain achievable for this feature under its specified split derivation method.
  • Approximate Greedy, by evaluating only a random subset of split points (e.g., ≤5.0 and ≤9.0), yielded significantly lower gains. This clearly demonstrates the trade-off, where a simplified search missed the higher gain possible for the numerical feature.
  • Histogram-based Greedy, by evaluating at predefined bin boundaries (e.g., ≤3.0 and ≤7.0), remarkably found a split at ≤3.0 that produced an Entropy Gain of 0.322 and Gini Gain of 0.180.
    These gains are identical to the maximum achieved by the Exact Greedy algorithm for this feature, indicating that the chosen bin boundary aligned perfectly with a highly effective split point in this particular dataset.

Conclusion

Decision trees are versatile and interpretable models that make predictions by navigating importance of input features.

In the experiment, we observed that the choice of algorithm and impurity measure significantly influences a decision tree’s structure and predictive performance.

Conversely, we also observed that decision trees are prone to overfitting.

When trees are used in machine learning, models employ regularization techniques to improve generalization capabilities and prevent overfitting.

Author: Kuriko IWAI

Portfolio / Github

All images, unless otherwise noted, are by the author. The article utilizes synthetic datasets.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *