C4.5 in detail and comparative analysis of decision tree algorithms

Previously, we have talked about 2 of the Decision Tree Algorithms:

1. Gini Index (while implementing CART – Classification)

2. ID3 – Iterative Dichotomiser 3 (while implementing CART – Regression)

There are a few more algorithms that you need to be aware of.

In this blog, we are going to learn about those algorithms and do a comparative study of all these algorithms.

In this post, you’ll learn

1. What is C4.5 algorithm

2. What is C5.0 algorithm

3. How are the above 2 algorithms different from each other

4. Which algorithm is better and why: ID3, C4.5, C5.0

If you would like to recap what Gini Index was about, visit: https://insightimi.wordpress.com/2020/03/08/classification-and-regression-tree/ and scroll to “What is Gini Index?”

And for ID3, visit: https://insightimi.wordpress.com/2020/03/15/cart-regression-tree-from-scratch-with-a-hands-on-examplein-r/ and scroll to “What are Greedy Learners?”

Before diving straight into C4.5, we urge you to visit and read the blog on CART – Regression Tree. (https://insightimi.wordpress.com/2020/03/15/cart-regression-tree-from-scratch-with-a-hands-on-examplein-r/)

It will give you more clarity on the topics like:

1. Greedy Learners

2. Entropy

3. Information Gain

4. Gain Ratio

5. SplitInfo

These topics are utmost important to be clear in your mind before proceeding further.

C4.5 Algorithm

ID3 is the most common decision tree algorithm these days. It was developed by Ross Quinlan in 1986. It creates a multi-way tree. In ID3, we first “greedily” find the categorical variable for every node which will yield the largest “information gain” for categorical features. After that, the trees are grown to their maximum size followed by which, a “pruning” step is applied (mostly) to improve the ability of the tree so that it can be applied to unseen data.

C4.5 is the successor to the ID3 algorithm. It was also developed by Ross Quinlan. It would be safe to say that C4.5 is an evolution of ID3.

ID3 uses Information Gain as the splitting criteria whereas C4.5 uses Gain Ratio as the splitting criteria. The main thought behind creating C4.5 was to overcome the issues that ID3 had. We’ll talk about these issues in a while.

Let’s understand how C4.5 works using the following dataset:

Dataset:

The data contains information on weather – related to temperature, humidity, wind, etc. This is a small dataset of 14 rows only.

The column description is as follows:

Data Code	Description
Outlook	Defines whether the day is Sunny, Overcast, or Rainy
Temperature	Values in Fahrenheit
Humidity	Values of Humidity
Wind	Weak or Strong
Decision	Did they go to play? (Yes or no)

Day	Outlook	Temp.	Humidity	Wind	Decision
1	Sunny	85	85	Weak	No
2	Sunny	80	90	Strong	No
3	Overcast	83	78	Weak	Yes
4	Rain	70	96	Weak	Yes
5	Rain	68	80	Weak	Yes
6	Rain	65	70	Strong	No
7	Overcast	64	65	Strong	Yes
8	Sunny	72	95	Weak	No
9	Sunny	69	70	Weak	Yes
10	Rain	75	80	Weak	Yes
11	Sunny	75	70	Strong	Yes
12	Overcast	72	90	Strong	Yes
13	Overcast	81	75	Weak	Yes
14	Rain	71	80	Strong	No

Process of applying C4.5

1. Calculate the global entropy using the formula:

2. Then calculate Gain Ratios (after calculating gains as we did in ID3) for all the attributes using the formula:

3. We compare the gains and gain ratios for all the attributes and take further decisions.

Calculating Global Entropy

There are 14 rows in our data. 9 of them lead to “Yes” decision and 5 lead to “No” decision.

Entropy = – ∑ p(i) * log₂p(i)

= – [p(Yes) * log₂p(Yes)] – [p(No) * log₂p(No)]

= – (9/14) * log₂(9/14) – (5/14) * log₂(5/14)

= 0.940

Calculating Gain & Gain Ratios:

GainRatio(A) = Gain(A) / SplitInfo(A)

SplitInfo(A) = -∑ |Dj|/|D| * log₂|Dj|/|D|

No need to panic from the above formula. Dj is nothing but number of cases of a particular value of an attribute. D here is the total number of cases of that attribute.

I. Gain & Gain Ratio for Outlook Variable:

Outlook variable is nominal. It has 3 values: Sunny, Overcast, Rain.

Gain (Decision, Outlook) = Entropy(Decision) – ∑ [ p(Decision|Outlook) * Entropy(Decision|Outlook) ]

The above big formula is nothing but the formula for calculating gain. Let’s call this Equation 1.

The first part, i.e, Entropy(Decision) has already been calculated by us as 0.940

The second part is the negative summation of the products of (i) Probability of that Outlook value to result in a “Yes” or “No” case, and (ii) Entropy of that Outlook value.

Let’s calculate this 2^nd part, i.e, Entropy

1. entropy for Outlook = Sunny

Day	Outlook	Temp.	Humidity	Wind	Decision
1	Sunny	85	85	Weak	No
2	Sunny	80	90	Strong	No
8	Sunny	72	95	Weak	No
9	Sunny	69	70	Weak	Yes
11	Sunny	75	70	Strong	Yes

We have 3 No decisions and 2 Yes decisions.

Entropy(Decision|Outlook=Sunny)

= – p(No) * log₂p(No) – p(Yes) * log₂p(Yes)

= -(3/5).log₂(3/5) – (2/5).log₂(2/5)

= 0.441 + 0.528

= 0.970

2. Entropy for Outlook = Overcast

Day	Outlook	Temp.	Humidity	Wind	Decision
3	Overcast	83	78	Weak	Yes
7	Overcast	64	65	Strong	Yes
12	Overcast	72	90	Strong	Yes
13	Overcast	81	75	Weak	Yes

All decisions are Yes here.

Entropy(Decision|Outlook=Overcast)

= – p(No) * log₂p(No) – p(Yes) * log₂p(Yes)

= -(0/4)*log₂(0/4) – (4/4)*log₂(4/4)

[Here log₂0 should be undefined. But we took it as 0. Because if we consider x*log₂x, then if x tends to 0, x*log₂x also tends to 0]

= 0

3. Entropy for Outlook = Rain

Day	Outlook	Temp.	Humidity	Wind	Decision
4	Rain	70	96	Weak	Yes
5	Rain	68	80	Weak	Yes
6	Rain	65	70	Strong	No
10	Rain	75	80	Weak	Yes
14	Rain	71	80	Strong	No

We have 3 Yes and 2 No decisions.

Entropy(Decision|Outlook=Rain)

= – p(No) * log₂p(No) – p(Yes) * log₂p(Yes)

= -(2/5)*log₂(2/5) – (3/5)*log₂(3/5)

= 0.528 + 0.441

= 0.970

4. Gain for Outlook variable:

We are done with calculating Entropies for Outlook variable.

Putting these in the Equation 1 above:

Gain(Decision, Outlook)

= 0.940 – (5/14)*(0.970) – (4/14)*(0) – (5/14)*(0.970)

= 0.247

5. SplitInfo for Outlook variable:

Sunny: 5 cases

Overcast: 4 cases

Rain: 5 cases

SplitInfo(Decision, Outlook)

= -(5/14)*log₂(5/14) -(4/14)*log₂(4/14) -(5/14)*log₂(5/14)

= 1.577

6. Finally, Gain Ratio for Outlook variable:

GainRatio(Decision, Outlook)

= Gain(Decision, Outlook)/SplitInfo(Decision, Outlook)

= 0.247/1.577

= 0.156

More work needs to be done. This is Gain Ratio for just 1 of the attributes. we have to calculate this Gain Ratio for the other variables too so that we can compare them at the end.

II. Gain & Gain Ratio for Wind Variable:

This is also a nominal variable. It has 2 values: Weak & Strong.

Gain (Decision, Wind) = Entropy(Decision) – ∑ [ p(Decision|Wind) * Entropy(Decision|Wind) ]

Let’s call this Equation 2.

1. Entropy for Wind = Weak

Day	Outlook	Temp.	Humidity	Wind	Decision
1	Sunny	85	85	Weak	No
3	Overcast	83	78	Weak	Yes
4	Rain	70	96	Weak	Yes
5	Rain	68	80	Weak	Yes
8	Sunny	72	95	Weak	No
9	Sunny	69	70	Weak	Yes
10	Rain	75	80	Weak	Yes
13	Overcast	81	75	Weak	Yes

We have 6 Yes and 2 No decisions.

Entropy(Decision|Wind=Weak)

= – p(No) * log₂p(No) – p(Yes) * log₂p(Yes)

= – (2/8) * log₂(2/8) – (6/8) * log₂(6/8)

= 0.811

2. Entropy for Wind = Strong

Day	Outlook	Temp.	Humidity	Wind	Decision
2	Sunny	80	90	Strong	No
6	Rain	65	70	Strong	No
7	Overcast	64	65	Strong	Yes
11	Sunny	75	70	Strong	Yes
12	Overcast	72	90	Strong	Yes
14	Rain	71	80	Strong	No

We have 3 Yes and 3 No decisions.

Entropy(Decision|Wind=Strong)

= – (3/6) * log₂(3/6) – (3/6) * log₂(3/6)

= 1

3. Gain for Wind variable:

Gain(Decision, Wind)

= 0.940 – (8/14)*(0.811) – (6/14)*(1)

= 0.940 – 0.463 – 0.428

= 0.049

4. SplitInfo for Wind variable:

Weak: 8 cases

Strong: 6 cases

SplitInfo(Decision, Wind)

= -(6/14)*log₂(6/14) -(8/14)*log₂(8/14)

= 0.524 + 0.461

= 0.985

5. Finally, Gain Ratio for Wind variable:

GainRatio(Decision, Wind)

= Gain(Decision, Wind)/SplitInfo(Decision, Wind)

= 0.049 / 0.985

= 0.049

III. Gain & Gain Ratio for Humidity Variable:

This is where things get interesting because Humidity is a continuous variable. How do we deal with them?

Step 1. Arrange the values in ascending order.

Step 2. Convert them to nominal values by performing a binary split on a threshold value.

[Gain for this variable must be maximum at the threshold value.]

Step 3. The gain at this threshold value will be used for comparison of gains and gain ratios of all the variables.

1. Let’s arrange it in ascending order of values of Humidity:

Day	Humidity	Decision
7	65	Yes
6	70	No
9	70	Yes
11	70	Yes
13	75	Yes
3	78	Yes
5	80	Yes
10	80	Yes
14	80	No
1	85	No
2	90	No
12	90	Yes
8	95	No
4	96	Yes

Now, we need to calculate the gains and gain ratios for every value of Humidity. The value which maximizes the gain would be the threshold. Here, we will separate our dataset in 2 parts: (i) values less than or equal to the current value, and (ii) values greater than the current value.

2. Calculating Gains and Gain Ratios for all values:

2.a. For Humidity = 65

We have 1 Yes & 0 No decisions at <= 65 and 8 Yes & 5 No decisions at > 65

Entropy(Decision|Humidity<=65)

= – p(No) . log2p(No) – p(Yes) . log2p(Yes)

= -(0/1).log2(0/1) – (1/1).log2(1/1)

= 0

Entropy(Decision|Humidity>65)

= -(5/13).log2(5/13) – (8/13).log2(8/13)

=0.530 + 0.431

= 0.961

Gain(Decision, Humidity<> 65)

= 0.940 – (1/14).0 – (13/14).(0.961)

= 0.048

SplitInfo(Decision, Humidity<> 65) =

-(1/14).log2(1/14) -(13/14).log2(13/14)

= 0.371

GainRatio(Decision, Humidity<> 65)

= 0.048/0.371

= 0.129

2.b. For Humidity = 70

We have 3 Yes & 1 No decisions at <= 70 and 6 Yes & 4 No decisions at > 70

Entropy(Decision|Humidity<=70)

= – p(No) . log2p(No) – p(Yes) . log2p(Yes)

= -(1/4).log2(1/4) – (3/4).log2(3/4)

= 0.811

Entropy(Decision|Humidity>70)

= -(4/10).log2(4/10) – (6/10).log2(6/10)

= 0.971

Gain(Decision, Humidity<> 70)

= 0.940 – (4/14).(0.811) – (10/14).(0.971)

= 0.014

SplitInfo(Decision, Humidity<> 70)

= -(4/14).log2(4/14) -(10/14).log2(10/14)

= 0.863

GainRatio(Decision, Humidity<> 70)

= 0.014/0.863

= 0.016

Similarly, calculate the Gains and Gain Ratios for all other values of Humidity.

We found out that the Gain was maximum for Humidity = 80

[Note: Here is something interesting. You can take either Gain or Gain Ratio as the threshold value. Both of them will result in different Decision Trees. We are taking Gain.]

Gain(Decision, Humidity <> 80) = 0.101

GainRatio(Decision, Humidity <> 80) = 0.107

IV. Gain & Gain Ratio for Temp. Variable:

This is also a continuous variable. We will repeat the steps we did for Humidity variable.

1. Let’s arrange it in ascending order of values of Temp:

Day	Temp.	Decision
7	64	Yes
6	65	No
5	68	Yes
9	69	Yes
4	70	Yes
14	71	No
8	72	No
12	72	Yes
10	75	Yes
11	75	Yes
2	80	No
13	81	Yes
3	83	Yes
1	85	No

2. Calculating Gains and Gain Ratios for all values:

2.a. For Temp = 64

We have 1 Yes & 0 No decisions at <= 64 and 8 Yes & 5 No decisions at > 64

Entropy(Decision|Temp<=64)

= – p(No) . log2p(No) – p(Yes) . log2p(Yes)

= -(0/1).log2(0/1) – (1/1).log2(1/1)

= 0

Entropy(Decision|Temp>64)

= -(5/13).log2(5/13) – (8/13).log2(8/13)

=0.530 + 0.431

= 0.961

Gain(Decision, Temp <> 64)

= 0.940 – (1/14).0 – (13/14).(0.961)

= 0.048

SplitInfo(Decision, Temp <> 64) =

-(1/14).log2(1/14) -(13/14).log2(13/14)

= 0.371

GainRatio(Decision, Temp <> 64)

= 0.048/0.371

= 0.129

2.b. For Temp = 65

We have 1 Yes & 1 No decisions at <= 65 and 8 Yes & 4 No decisions at > 65

Entropy(Decision|Temp<=65)

= – p(No) . log2p(No) – p(Yes) . log2p(Yes)

= -(1/2).log2(1/2) – (1/2).log2(1/2)

= 1

Entropy(Decision|Temp>65)

= -(4/12).log2(4/12) – (8/12).log2(8/12)

= 0.918

Gain(Decision, Temp<> 65)

= 0.940 – (2/14).1 – (12/14).(0.918)

= 0.010

SplitInfo(Decision, Temp<> 65)

= -(2/14).log2(2/14) -(12/14).log2(12/14)

= 0.591

GainRatio(Decision, Temp<> 65)

= 0.010/0.591

= 0.017

Similarly, calculate the Gains and Gain Ratios for all other values of Temp.

We found out that the Gain was maximum for Temp = 83

Gain(Decision, Temp <> 83) = 0.113

GainRatio(Decision, Temp <> 83) = 0.305

Comparison of Gains and Gain Ratios

Attribute	Gain	Gain Ratio
Wind	0.049	0.049
Outlook	0.247	0.156
Humidity <> 80	0.101	0.107
Temp <> 83	0.113	0.305

If we use Gain, Outlook will be the root node. (Because it has the highest Gain value)

Similarly, if we use Gain Ratio, Temp will be the root node.

We will proceed using the Gain.

Outlook = Sunny

Day	Outlook	Temp.	Humidity	Wind	Decision
1	Sunny	85	85	Weak	No
2	Sunny	80	90	Strong	No
8	Sunny	72	95	Weak	No
9	Sunny	69	70	Weak	Yes
11	Sunny	75	70	Strong	Yes

If humidity > 80, decision is ‘No’

If humidity <= 80, decision is ‘Yes’

Outlook = Overcast

Day	Outlook	Temp.	Humidity	Wind	Decision
3	Overcast	83	78	Weak	Yes
7	Overcast	64	65	Strong	Yes
12	Overcast	72	90	Strong	Yes
13	Overcast	81	75	Weak	Yes

All decisions are ‘Yes’

Outlook = Rain

Day	Outlook	Temp.	Humidity	Wind	Decision
4	Rain	70	96	Weak	Yes
5	Rain	68	80	Weak	Yes
6	Rain	65	70	Strong	No
10	Rain	75	80	Weak	Yes
14	Rain	71	80	Strong	No

If Wind = Weak, decision is ‘No’

If Wind = Strong, decision is ‘Yes’

So, this is our final Decision Tree using C4.5 algorithm.

Advantages of C4.5 over ID3

C4.5 is an evolution of ID3 by the same author (Quinlan). He made sure that the bottlenecks are not repeated in C4.5 that were present in ID3. Following are the improvements he made in C4.5

1. It can handle both continuous and discrete variables.

2. It can handle missing values by marking them as ‘?’. They are not used in Gain and Entropy calculations.

3. Prunes the tree and thereby avoids ‘overfitting’.

C5 Algorithm:

C5 algorithm is an evolved version of C4.5 algorithm. It was also created by Ross Quinlan from Stanford University. It has many advantages over C4.5:

1. It is much faster than C4.5 in computation.

2. It supports more number of data types which previous algorithms do not support. For ex: Date, timestamp, etc.

3. It is more memory-efficient.

4. It improves the trees by Boosting which gives more accuracy to the trees.

C5 Algorithm works the same way as C4.5 algorithm does. The underlying basic concepts are same. It also makes use of Information gain and Gain Ratio to make decision trees. Just a few modifications that we have mentioned above!

For the coding related to CART, please visit the previous blogs, the links to which are as follows:

CART – Regresssion Trees – https://insightimi.wordpress.com/2020/03/15/cart-regression-tree-from-scratch-with-a-hands-on-examplein-r/

CART – Classification – https://insightimi.wordpress.com/2020/03/08/classification-and-regression-tree/

Conclusion

So, Gini Index is used in CART and is mainly used for Classification purpose only. ID3 was the first algorithm in its family which got superseded by its successor, C4.5. C4.5 was a major improvement in ID3 and these are the most commonly used algorithms in the industry. C5.0 was yet another improvement in this family of algorithms. It brought some more modifications to C4.5.

You should try making a C4.5 decision tree using Gain Ratio instead of Gain. Try it and let us know in the comments if you face any issue.