C4.5 in detail and comparative analysis of decision tree algorithms

Previously, we have talked about 2 of the Decision Tree Algorithms:

1. Gini Index (while implementing CART – Classification)

2. ID3 – Iterative Dichotomiser 3 (while implementing CART – Regression)

There are a few more algorithms that you need to be aware of.

In this blog, we are going to learn about those algorithms and do a comparative study of all these algorithms.

In this post, you’ll learn

1. What is C4.5 algorithm

2. What is C5.0 algorithm

3. How are the above 2 algorithms different from each other

4. Which algorithm is better and why: ID3, C4.5, C5.0

If you would like to recap what Gini Index was about, visit: https://insightimi.wordpress.com/2020/03/08/classification-and-regression-tree/ and scroll to “What is Gini Index?”

And for ID3, visit: https://insightimi.wordpress.com/2020/03/15/cart-regression-tree-from-scratch-with-a-hands-on-examplein-r/ and scroll to “What are Greedy Learners?”

Before diving straight into C4.5, we urge you to visit and read the blog on CART – Regression Tree. (https://insightimi.wordpress.com/2020/03/15/cart-regression-tree-from-scratch-with-a-hands-on-examplein-r/)

It will give you more clarity on the topics like:

1. Greedy Learners

2. Entropy

3. Information Gain

4. Gain Ratio

5. SplitInfo

These topics are utmost important to be clear in your mind before proceeding further.

C4.5 Algorithm

ID3 is the most common decision tree algorithm these days. It was developed by Ross Quinlan in 1986. It creates a multi-way tree. In ID3, we first “greedily” find the categorical variable for every node which will yield the largest “information gain” for categorical features. After that, the trees are grown to their maximum size followed by which, a “pruning” step is applied (mostly) to improve the ability of the tree so that it can be applied to unseen data.

C4.5 is the successor to the ID3 algorithm. It was also developed by Ross Quinlan. It would be safe to say that C4.5 is an evolution of ID3.

ID3 uses Information Gain as the splitting criteria whereas C4.5 uses Gain Ratio as the splitting criteria. The main thought behind creating C4.5 was to overcome the issues that ID3 had. We’ll talk about these issues in a while.

Let’s understand how C4.5 works using the following dataset:

Dataset:

The data contains information on weather – related to temperature, humidity, wind, etc. This is a small dataset of 14 rows only.

The column description is as follows:

Data CodeDescription
OutlookDefines whether the day is Sunny, Overcast, or Rainy
TemperatureValues in Fahrenheit
HumidityValues of Humidity
WindWeak or Strong
DecisionDid they go to play? (Yes or no)
DayOutlookTemp.HumidityWindDecision
1Sunny8585WeakNo
2Sunny8090StrongNo
3Overcast8378WeakYes
4Rain7096WeakYes
5Rain6880WeakYes
6Rain6570StrongNo
7Overcast6465StrongYes
8Sunny7295WeakNo
9Sunny6970WeakYes
10Rain7580WeakYes
11Sunny7570StrongYes
12Overcast7290StrongYes
13Overcast8175WeakYes
14Rain7180StrongNo

Process of applying C4.5

1. Calculate the global entropy using the formula:

2. Then calculate Gain Ratios (after calculating gains as we did in ID3) for all the attributes using the formula:

3. We compare the gains and gain ratios for all the attributes and take further decisions.

Calculating Global Entropy

There are 14 rows in our data. 9 of them lead to “Yes” decision and 5 lead to “No” decision.

Entropy = – ∑ p(i) * log2p(i)

= – [p(Yes) * log2p(Yes)] – [p(No) * log2p(No)]

= – (9/14) * log2(9/14) – (5/14) * log2(5/14)

= 0.940

Calculating Gain & Gain Ratios:

GainRatio(A) = Gain(A) / SplitInfo(A)

SplitInfo(A) = -∑ |Dj|/|D| * log2|Dj|/|D|

No need to panic from the above formula. Dj is nothing but number of cases of a particular value of an attribute. D here is the total number of cases of that attribute.

I. Gain & Gain Ratio for Outlook Variable:

Outlook variable is nominal. It has 3 values: Sunny, Overcast, Rain.

Gain (Decision, Outlook) = Entropy(Decision)∑ [ p(Decision|Outlook) * Entropy(Decision|Outlook) ]

The above big formula is nothing but the formula for calculating gain. Let’s call this Equation 1.

The first part, i.e, Entropy(Decision) has already been calculated by us as 0.940

The second part is the negative summation of the products of (i) Probability of that Outlook value to result in a “Yes” or “No” case, and (ii) Entropy of that Outlook value.

Let’s calculate this 2nd part, i.e, Entropy

1. entropy for Outlook = Sunny

DayOutlookTemp.HumidityWindDecision
1Sunny8585WeakNo
2Sunny8090StrongNo
8Sunny7295WeakNo
9Sunny6970WeakYes
11Sunny7570StrongYes

We have 3 No decisions and 2 Yes decisions.

Entropy(Decision|Outlook=Sunny)

= – p(No) * log2p(No) – p(Yes) * log2p(Yes)

= -(3/5).log2(3/5) – (2/5).log2(2/5)

= 0.441 + 0.528

= 0.970

2. Entropy for Outlook = Overcast

DayOutlookTemp.HumidityWindDecision
3Overcast8378WeakYes
7Overcast6465StrongYes
12Overcast7290StrongYes
13Overcast8175WeakYes

All decisions are Yes here.

Entropy(Decision|Outlook=Overcast)

= – p(No) * log2p(No) – p(Yes) * log2p(Yes)

= -(0/4)*log2(0/4) – (4/4)*log2(4/4)

[Here log20 should be undefined. But we took it as 0. Because if we consider x*log2x, then if x tends to 0, x*log2x also tends to 0]

= 0

3. Entropy for Outlook = Rain

DayOutlookTemp.HumidityWindDecision
4Rain7096WeakYes
5Rain6880WeakYes
6Rain6570StrongNo
10Rain7580WeakYes
14Rain7180StrongNo

We have 3 Yes and 2 No decisions.

Entropy(Decision|Outlook=Rain)

= – p(No) * log2p(No) – p(Yes) * log2p(Yes)

= -(2/5)*log2(2/5) – (3/5)*log2(3/5)

= 0.528 + 0.441

= 0.970

4. Gain for Outlook variable:

We are done with calculating Entropies for Outlook variable.

Putting these in the Equation 1 above:

Gain(Decision, Outlook)

= 0.940 – (5/14)*(0.970) – (4/14)*(0) – (5/14)*(0.970)

= 0.247

5. SplitInfo for Outlook variable:

Sunny: 5 cases

Overcast: 4 cases

Rain: 5 cases

SplitInfo(Decision, Outlook)

= -(5/14)*log2(5/14) -(4/14)*log2(4/14) -(5/14)*log2(5/14)

= 1.577

6. Finally, Gain Ratio for Outlook variable:

GainRatio(Decision, Outlook)

= Gain(Decision, Outlook)/SplitInfo(Decision, Outlook)

= 0.247/1.577

= 0.156

More work needs to be done. This is Gain Ratio for just 1 of the attributes. we have to calculate this Gain Ratio for the other variables too so that we can compare them at the end.

II. Gain & Gain Ratio for Wind Variable:

This is also a nominal variable. It has 2 values: Weak & Strong.

Gain (Decision, Wind) = Entropy(Decision)∑ [ p(Decision|Wind) * Entropy(Decision|Wind) ]

Let’s call this Equation 2.

1. Entropy for Wind = Weak

DayOutlookTemp.HumidityWindDecision
1Sunny8585WeakNo
3Overcast8378WeakYes
4Rain7096WeakYes
5Rain6880WeakYes
8Sunny7295WeakNo
9Sunny6970WeakYes
10Rain7580WeakYes
13Overcast8175WeakYes

We have 6 Yes and 2 No decisions.

Entropy(Decision|Wind=Weak)                        

= – p(No) * log2p(No) – p(Yes) * log2p(Yes)

= – (2/8) * log2(2/8) – (6/8) * log2(6/8)

= 0.811

2. Entropy for Wind = Strong

DayOutlookTemp.HumidityWindDecision
2Sunny8090StrongNo
6Rain6570StrongNo
7Overcast6465StrongYes
11Sunny7570StrongYes
12Overcast7290StrongYes
14Rain7180StrongNo

We have 3 Yes and 3 No decisions.

Entropy(Decision|Wind=Strong)

= – (3/6) * log2(3/6) – (3/6) * log2(3/6)

= 1

3. Gain for Wind variable:

Gain(Decision, Wind)

= 0.940 – (8/14)*(0.811) – (6/14)*(1)

= 0.940 – 0.463 – 0.428

= 0.049

4. SplitInfo for Wind variable:

Weak: 8 cases

Strong: 6 cases

SplitInfo(Decision, Wind)

= -(6/14)*log2(6/14) -(8/14)*log2(8/14)

= 0.524 + 0.461

= 0.985

5. Finally, Gain Ratio for Wind variable:

GainRatio(Decision, Wind)

= Gain(Decision, Wind)/SplitInfo(Decision, Wind)

= 0.049 / 0.985

= 0.049

III. Gain & Gain Ratio for Humidity Variable:

This is where things get interesting because Humidity is a continuous variable. How do we deal with them?

Step 1. Arrange the values in ascending order.

Step 2. Convert them to nominal values by performing a binary split on a threshold value.

[Gain for this variable must be maximum at the threshold value.]

Step 3. The gain at this threshold value will be used for comparison of gains and gain ratios of all the variables.

1. Let’s arrange it in ascending order of values of Humidity:

DayHumidityDecision
765Yes
670No
970Yes
1170Yes
1375Yes
378Yes
580Yes
1080Yes
1480No
185No
290No
1290Yes
895No
496Yes

Now, we need to calculate the gains and gain ratios for every value of Humidity. The value which maximizes the gain would be the threshold. Here, we will separate our dataset in 2 parts: (i) values less than or equal to the current value, and (ii) values greater than the current value.

2. Calculating Gains and Gain Ratios for all values:

2.a. For Humidity = 65

We have 1 Yes & 0 No decisions at <= 65 and 8 Yes & 5 No decisions at > 65

Entropy(Decision|Humidity<=65)

= – p(No) . log2p(No) – p(Yes) . log2p(Yes)

= -(0/1).log2(0/1) – (1/1).log2(1/1)

= 0

Entropy(Decision|Humidity>65)

= -(5/13).log2(5/13) – (8/13).log2(8/13)

=0.530 + 0.431

= 0.961

Gain(Decision, Humidity<> 65)

= 0.940 – (1/14).0 – (13/14).(0.961)

= 0.048

SplitInfo(Decision, Humidity<> 65) =

-(1/14).log2(1/14) -(13/14).log2(13/14)

= 0.371

GainRatio(Decision, Humidity<> 65)

= 0.048/0.371

= 0.129

2.b. For Humidity = 70

We have 3 Yes & 1 No decisions at <= 70 and 6 Yes & 4 No decisions at > 70

Entropy(Decision|Humidity<=70)

= – p(No) . log2p(No) – p(Yes) . log2p(Yes)

= -(1/4).log2(1/4) – (3/4).log2(3/4)

= 0.811

Entropy(Decision|Humidity>70)

= -(4/10).log2(4/10) – (6/10).log2(6/10)

= 0.971

Gain(Decision, Humidity<> 70)

= 0.940 – (4/14).(0.811) – (10/14).(0.971)

= 0.014

SplitInfo(Decision, Humidity<> 70)

= -(4/14).log2(4/14) -(10/14).log2(10/14)

= 0.863

GainRatio(Decision, Humidity<> 70)

= 0.014/0.863

= 0.016

Similarly, calculate the Gains and Gain Ratios for all other values of Humidity.

We found out that the Gain was maximum for Humidity = 80

[Note: Here is something interesting. You can take either Gain or Gain Ratio as the threshold value. Both of them will result in different Decision Trees. We are taking Gain.]

Gain(Decision, Humidity <> 80) = 0.101

GainRatio(Decision, Humidity <> 80) = 0.107

IV. Gain & Gain Ratio for Temp. Variable:

This is also a continuous variable. We will repeat the steps we did for Humidity variable.

1. Let’s arrange it in ascending order of values of Temp:

DayTemp.Decision
764Yes
665No
568Yes
969Yes
470Yes
1471No
872No
1272Yes
1075Yes
1175Yes
280No
1381Yes
383Yes
185No

2. Calculating Gains and Gain Ratios for all values:

2.a. For Temp = 64

We have 1 Yes & 0 No decisions at <= 64 and 8 Yes & 5 No decisions at > 64

Entropy(Decision|Temp<=64)

= – p(No) . log2p(No) – p(Yes) . log2p(Yes)

= -(0/1).log2(0/1) – (1/1).log2(1/1)

= 0

Entropy(Decision|Temp>64)

= -(5/13).log2(5/13) – (8/13).log2(8/13)

=0.530 + 0.431

= 0.961

Gain(Decision, Temp <> 64)

= 0.940 – (1/14).0 – (13/14).(0.961)

= 0.048

SplitInfo(Decision, Temp <> 64) =

-(1/14).log2(1/14) -(13/14).log2(13/14)

= 0.371

GainRatio(Decision, Temp <> 64)

= 0.048/0.371

= 0.129

2.b. For Temp = 65

We have 1 Yes & 1 No decisions at <= 65 and 8 Yes & 4 No decisions at > 65

Entropy(Decision|Temp<=65)

= – p(No) . log2p(No) – p(Yes) . log2p(Yes)

= -(1/2).log2(1/2) – (1/2).log2(1/2)

= 1

Entropy(Decision|Temp>65)

= -(4/12).log2(4/12) – (8/12).log2(8/12)

= 0.918

Gain(Decision, Temp<> 65)

= 0.940 – (2/14).1 – (12/14).(0.918)

= 0.010

SplitInfo(Decision, Temp<> 65)

= -(2/14).log2(2/14) -(12/14).log2(12/14)

= 0.591

GainRatio(Decision, Temp<> 65)

= 0.010/0.591

= 0.017

Similarly, calculate the Gains and Gain Ratios for all other values of Temp.

We found out that the Gain was maximum for Temp = 83

Gain(Decision, Temp <> 83) = 0.113

GainRatio(Decision, Temp <> 83) = 0.305

Comparison of Gains and Gain Ratios

AttributeGainGain Ratio
Wind 0.049 0.049
Outlook 0.247 0.156
Humidity <> 80 0.101 0.107
Temp <> 83 0.113 0.305

If we use Gain, Outlook will be the root node. (Because it has the highest Gain value)

Similarly, if we use Gain Ratio, Temp will be the root node.

We will proceed using the Gain.

Outlook = Sunny

DayOutlookTemp.HumidityWindDecision
1Sunny8585WeakNo
2Sunny8090StrongNo
8Sunny7295WeakNo
9Sunny6970WeakYes
11Sunny7570StrongYes

If humidity > 80, decision is ‘No’

If humidity <= 80, decision is ‘Yes’

Outlook = Overcast

DayOutlookTemp.HumidityWindDecision
3Overcast8378WeakYes
7Overcast6465StrongYes
12Overcast7290StrongYes
13Overcast8175WeakYes

All decisions are ‘Yes’

Outlook = Rain

DayOutlookTemp.HumidityWindDecision
4Rain7096WeakYes
5Rain6880WeakYes
6Rain6570StrongNo
10Rain7580WeakYes
14Rain7180StrongNo

If Wind = Weak, decision is ‘No’

If Wind = Strong, decision is ‘Yes’

So, this is our final Decision Tree using C4.5 algorithm.

Advantages of C4.5 over ID3

C4.5 is an evolution of ID3 by the same author (Quinlan). He made sure that the bottlenecks are not repeated in C4.5 that were present in ID3. Following are the improvements he made in C4.5

1. It can handle both continuous and discrete variables.

2. It can handle missing values by marking them as ‘?’. They are not used in Gain and Entropy calculations.

3. Prunes the tree and thereby avoids ‘overfitting’.

C5 Algorithm:

C5 algorithm is an evolved version of C4.5 algorithm. It was also created by Ross Quinlan from Stanford University. It has many advantages over C4.5:

1. It is much faster than C4.5 in computation.

2. It supports more number of data types which previous algorithms do not support. For ex: Date, timestamp, etc.

3. It is more memory-efficient.

4. It improves the trees by Boosting which gives more accuracy to the trees.

C5 Algorithm works the same way as C4.5 algorithm does. The underlying basic concepts are same. It also makes use of Information gain and Gain Ratio to make decision trees. Just a few modifications that we have mentioned above!

For the coding related to CART, please visit the previous blogs, the links to which are as follows:

CART – Regresssion Trees – https://insightimi.wordpress.com/2020/03/15/cart-regression-tree-from-scratch-with-a-hands-on-examplein-r/

CART – Classification – https://insightimi.wordpress.com/2020/03/08/classification-and-regression-tree/

Conclusion

So, Gini Index is used in CART and is mainly used for Classification purpose only. ID3 was the first algorithm in its family which got superseded by its successor, C4.5. C4.5 was a major improvement in ID3 and these are the most commonly used algorithms in the industry. C5.0 was yet another improvement in this family of algorithms. It brought some more modifications to C4.5.

You should try making a C4.5 decision tree using Gain Ratio instead of Gain. Try it and let us know in the comments if you face any issue.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.