Previously, we have talked about 2 of the Decision Tree Algorithms:
1. Gini Index (while implementing CART – Classification)
2. ID3 – Iterative Dichotomiser 3 (while implementing CART – Regression)
There are a few more algorithms that you need to be aware of.
In this blog, we are going to learn about those algorithms and do a comparative study of all these algorithms.
In this post, you’ll learn
1. What is C4.5 algorithm
2. What is C5.0 algorithm
3. How are the above 2 algorithms different from each other
4. Which algorithm is better and why: ID3, C4.5, C5.0
If you would like to recap what Gini Index was about, visit: https://insightimi.wordpress.com/2020/03/08/classification-and-regression-tree/ and scroll to “What is Gini Index?”
And for ID3, visit: https://insightimi.wordpress.com/2020/03/15/cart-regression-tree-from-scratch-with-a-hands-on-examplein-r/ and scroll to “What are Greedy Learners?”
Before diving straight into C4.5, we urge you to visit and read the blog on CART – Regression Tree. (https://insightimi.wordpress.com/2020/03/15/cart-regression-tree-from-scratch-with-a-hands-on-examplein-r/)
It will give you more clarity on the topics like:
1. Greedy Learners
2. Entropy
3. Information Gain
4. Gain Ratio
5. SplitInfo
These topics are utmost important to be clear in your mind before proceeding further.
C4.5 Algorithm
ID3 is the most common decision tree algorithm these days. It was developed by Ross Quinlan in 1986. It creates a multi-way tree. In ID3, we first “greedily” find the categorical variable for every node which will yield the largest “information gain” for categorical features. After that, the trees are grown to their maximum size followed by which, a “pruning” step is applied (mostly) to improve the ability of the tree so that it can be applied to unseen data.
C4.5 is the successor to the ID3 algorithm. It was also developed by Ross Quinlan. It would be safe to say that C4.5 is an evolution of ID3.
ID3 uses Information Gain as the splitting criteria whereas C4.5 uses Gain Ratio as the splitting criteria. The main thought behind creating C4.5 was to overcome the issues that ID3 had. We’ll talk about these issues in a while.
Let’s understand how C4.5 works using the following dataset:
Dataset:
The data contains information on weather – related to temperature, humidity, wind, etc. This is a small dataset of 14 rows only.
The column description is as follows:
Data Code | Description |
Outlook | Defines whether the day is Sunny, Overcast, or Rainy |
Temperature | Values in Fahrenheit |
Humidity | Values of Humidity |
Wind | Weak or Strong |
Decision | Did they go to play? (Yes or no) |
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | 85 | 85 | Weak | No |
2 | Sunny | 80 | 90 | Strong | No |
3 | Overcast | 83 | 78 | Weak | Yes |
4 | Rain | 70 | 96 | Weak | Yes |
5 | Rain | 68 | 80 | Weak | Yes |
6 | Rain | 65 | 70 | Strong | No |
7 | Overcast | 64 | 65 | Strong | Yes |
8 | Sunny | 72 | 95 | Weak | No |
9 | Sunny | 69 | 70 | Weak | Yes |
10 | Rain | 75 | 80 | Weak | Yes |
11 | Sunny | 75 | 70 | Strong | Yes |
12 | Overcast | 72 | 90 | Strong | Yes |
13 | Overcast | 81 | 75 | Weak | Yes |
14 | Rain | 71 | 80 | Strong | No |
Process of applying C4.5
1. Calculate the global entropy using the formula:
2. Then calculate Gain Ratios (after calculating gains as we did in ID3) for all the attributes using the formula:
3. We compare the gains and gain ratios for all the attributes and take further decisions.
Calculating Global Entropy
There are 14 rows in our data. 9 of them lead to “Yes” decision and 5 lead to “No” decision.
Entropy = – ∑ p(i) * log2p(i)
= – [p(Yes) * log2p(Yes)] – [p(No) * log2p(No)]
= – (9/14) * log2(9/14) – (5/14) * log2(5/14)
= 0.940
Calculating Gain & Gain Ratios:
GainRatio(A) = Gain(A) / SplitInfo(A)
SplitInfo(A) = -∑ |Dj|/|D| * log2|Dj|/|D|
No need to panic from the above formula. Dj is nothing but number of cases of a particular value of an attribute. D here is the total number of cases of that attribute.
I. Gain & Gain Ratio for Outlook Variable:
Outlook variable is nominal. It has 3 values: Sunny, Overcast, Rain.
Gain (Decision, Outlook) = Entropy(Decision) – ∑ [ p(Decision|Outlook) * Entropy(Decision|Outlook) ]
The above big formula is nothing but the formula for calculating gain. Let’s call this Equation 1.
The first part, i.e, Entropy(Decision) has already been calculated by us as 0.940
The second part is the negative summation of the products of (i) Probability of that Outlook value to result in a “Yes” or “No” case, and (ii) Entropy of that Outlook value.
Let’s calculate this 2nd part, i.e, Entropy
1. entropy for Outlook = Sunny
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | 85 | 85 | Weak | No |
2 | Sunny | 80 | 90 | Strong | No |
8 | Sunny | 72 | 95 | Weak | No |
9 | Sunny | 69 | 70 | Weak | Yes |
11 | Sunny | 75 | 70 | Strong | Yes |
We have 3 No decisions and 2 Yes decisions.
Entropy(Decision|Outlook=Sunny)
= – p(No) * log2p(No) – p(Yes) * log2p(Yes)
= -(3/5).log2(3/5) – (2/5).log2(2/5)
= 0.441 + 0.528
= 0.970
2. Entropy for Outlook = Overcast
Day | Outlook | Temp. | Humidity | Wind | Decision |
3 | Overcast | 83 | 78 | Weak | Yes |
7 | Overcast | 64 | 65 | Strong | Yes |
12 | Overcast | 72 | 90 | Strong | Yes |
13 | Overcast | 81 | 75 | Weak | Yes |
All decisions are Yes here.
Entropy(Decision|Outlook=Overcast)
= – p(No) * log2p(No) – p(Yes) * log2p(Yes)
= -(0/4)*log2(0/4) – (4/4)*log2(4/4)
[Here log20 should be undefined. But we took it as 0. Because if we consider x*log2x, then if x tends to 0, x*log2x also tends to 0]
= 0
3. Entropy for Outlook = Rain
Day | Outlook | Temp. | Humidity | Wind | Decision |
4 | Rain | 70 | 96 | Weak | Yes |
5 | Rain | 68 | 80 | Weak | Yes |
6 | Rain | 65 | 70 | Strong | No |
10 | Rain | 75 | 80 | Weak | Yes |
14 | Rain | 71 | 80 | Strong | No |
We have 3 Yes and 2 No decisions.
Entropy(Decision|Outlook=Rain)
= – p(No) * log2p(No) – p(Yes) * log2p(Yes)
= -(2/5)*log2(2/5) – (3/5)*log2(3/5)
= 0.528 + 0.441
= 0.970
4. Gain for Outlook variable:
We are done with calculating Entropies for Outlook variable.
Putting these in the Equation 1 above:
Gain(Decision, Outlook)
= 0.940 – (5/14)*(0.970) – (4/14)*(0) – (5/14)*(0.970)
= 0.247
5. SplitInfo for Outlook variable:
Sunny: 5 cases
Overcast: 4 cases
Rain: 5 cases
SplitInfo(Decision, Outlook)
= -(5/14)*log2(5/14) -(4/14)*log2(4/14) -(5/14)*log2(5/14)
= 1.577
6. Finally, Gain Ratio for Outlook variable:
GainRatio(Decision, Outlook)
= Gain(Decision, Outlook)/SplitInfo(Decision, Outlook)
= 0.247/1.577
= 0.156
More work needs to be done. This is Gain Ratio for just 1 of the attributes. we have to calculate this Gain Ratio for the other variables too so that we can compare them at the end.
II. Gain & Gain Ratio for Wind Variable:
This is also a nominal variable. It has 2 values: Weak & Strong.
Gain (Decision, Wind) = Entropy(Decision) – ∑ [ p(Decision|Wind) * Entropy(Decision|Wind) ]
Let’s call this Equation 2.
1. Entropy for Wind = Weak
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | 85 | 85 | Weak | No |
3 | Overcast | 83 | 78 | Weak | Yes |
4 | Rain | 70 | 96 | Weak | Yes |
5 | Rain | 68 | 80 | Weak | Yes |
8 | Sunny | 72 | 95 | Weak | No |
9 | Sunny | 69 | 70 | Weak | Yes |
10 | Rain | 75 | 80 | Weak | Yes |
13 | Overcast | 81 | 75 | Weak | Yes |
We have 6 Yes and 2 No decisions.
Entropy(Decision|Wind=Weak)
= – p(No) * log2p(No) – p(Yes) * log2p(Yes)
= – (2/8) * log2(2/8) – (6/8) * log2(6/8)
= 0.811
2. Entropy for Wind = Strong
Day | Outlook | Temp. | Humidity | Wind | Decision |
2 | Sunny | 80 | 90 | Strong | No |
6 | Rain | 65 | 70 | Strong | No |
7 | Overcast | 64 | 65 | Strong | Yes |
11 | Sunny | 75 | 70 | Strong | Yes |
12 | Overcast | 72 | 90 | Strong | Yes |
14 | Rain | 71 | 80 | Strong | No |
We have 3 Yes and 3 No decisions.
Entropy(Decision|Wind=Strong)
= – (3/6) * log2(3/6) – (3/6) * log2(3/6)
= 1
3. Gain for Wind variable:
Gain(Decision, Wind)
= 0.940 – (8/14)*(0.811) – (6/14)*(1)
= 0.940 – 0.463 – 0.428
= 0.049
4. SplitInfo for Wind variable:
Weak: 8 cases
Strong: 6 cases
SplitInfo(Decision, Wind)
= -(6/14)*log2(6/14) -(8/14)*log2(8/14)
= 0.524 + 0.461
= 0.985
5. Finally, Gain Ratio for Wind variable:
GainRatio(Decision, Wind)
= Gain(Decision, Wind)/SplitInfo(Decision, Wind)
= 0.049 / 0.985
= 0.049
III. Gain & Gain Ratio for Humidity Variable:
This is where things get interesting because Humidity is a continuous variable. How do we deal with them?
Step 1. Arrange the values in ascending order.
Step 2. Convert them to nominal values by performing a binary split on a threshold value.
[Gain for this variable must be maximum at the threshold value.]
Step 3. The gain at this threshold value will be used for comparison of gains and gain ratios of all the variables.
1. Let’s arrange it in ascending order of values of Humidity:
Day | Humidity | Decision |
7 | 65 | Yes |
6 | 70 | No |
9 | 70 | Yes |
11 | 70 | Yes |
13 | 75 | Yes |
3 | 78 | Yes |
5 | 80 | Yes |
10 | 80 | Yes |
14 | 80 | No |
1 | 85 | No |
2 | 90 | No |
12 | 90 | Yes |
8 | 95 | No |
4 | 96 | Yes |
Now, we need to calculate the gains and gain ratios for every value of Humidity. The value which maximizes the gain would be the threshold. Here, we will separate our dataset in 2 parts: (i) values less than or equal to the current value, and (ii) values greater than the current value.
2. Calculating Gains and Gain Ratios for all values:
2.a. For Humidity = 65
We have 1 Yes & 0 No decisions at <= 65 and 8 Yes & 5 No decisions at > 65
Entropy(Decision|Humidity<=65)
= – p(No) . log2p(No) – p(Yes) . log2p(Yes)
= -(0/1).log2(0/1) – (1/1).log2(1/1)
= 0
Entropy(Decision|Humidity>65)
= -(5/13).log2(5/13) – (8/13).log2(8/13)
=0.530 + 0.431
= 0.961
Gain(Decision, Humidity<> 65)
= 0.940 – (1/14).0 – (13/14).(0.961)
= 0.048
SplitInfo(Decision, Humidity<> 65) =
-(1/14).log2(1/14) -(13/14).log2(13/14)
= 0.371
GainRatio(Decision, Humidity<> 65)
= 0.048/0.371
= 0.129
2.b. For Humidity = 70
We have 3 Yes & 1 No decisions at <= 70 and 6 Yes & 4 No decisions at > 70
Entropy(Decision|Humidity<=70)
= – p(No) . log2p(No) – p(Yes) . log2p(Yes)
= -(1/4).log2(1/4) – (3/4).log2(3/4)
= 0.811
Entropy(Decision|Humidity>70)
= -(4/10).log2(4/10) – (6/10).log2(6/10)
= 0.971
Gain(Decision, Humidity<> 70)
= 0.940 – (4/14).(0.811) – (10/14).(0.971)
= 0.014
SplitInfo(Decision, Humidity<> 70)
= -(4/14).log2(4/14) -(10/14).log2(10/14)
= 0.863
GainRatio(Decision, Humidity<> 70)
= 0.014/0.863
= 0.016
Similarly, calculate the Gains and Gain Ratios for all other values of Humidity.
We found out that the Gain was maximum for Humidity = 80
[Note: Here is something interesting. You can take either Gain or Gain Ratio as the threshold value. Both of them will result in different Decision Trees. We are taking Gain.]
Gain(Decision, Humidity <> 80) = 0.101
GainRatio(Decision, Humidity <> 80) = 0.107
IV. Gain & Gain Ratio for Temp. Variable:
This is also a continuous variable. We will repeat the steps we did for Humidity variable.
1. Let’s arrange it in ascending order of values of Temp:
Day | Temp. | Decision |
7 | 64 | Yes |
6 | 65 | No |
5 | 68 | Yes |
9 | 69 | Yes |
4 | 70 | Yes |
14 | 71 | No |
8 | 72 | No |
12 | 72 | Yes |
10 | 75 | Yes |
11 | 75 | Yes |
2 | 80 | No |
13 | 81 | Yes |
3 | 83 | Yes |
1 | 85 | No |
2. Calculating Gains and Gain Ratios for all values:
2.a. For Temp = 64
We have 1 Yes & 0 No decisions at <= 64 and 8 Yes & 5 No decisions at > 64
Entropy(Decision|Temp<=64)
= – p(No) . log2p(No) – p(Yes) . log2p(Yes)
= -(0/1).log2(0/1) – (1/1).log2(1/1)
= 0
Entropy(Decision|Temp>64)
= -(5/13).log2(5/13) – (8/13).log2(8/13)
=0.530 + 0.431
= 0.961
Gain(Decision, Temp <> 64)
= 0.940 – (1/14).0 – (13/14).(0.961)
= 0.048
SplitInfo(Decision, Temp <> 64) =
-(1/14).log2(1/14) -(13/14).log2(13/14)
= 0.371
GainRatio(Decision, Temp <> 64)
= 0.048/0.371
= 0.129
2.b. For Temp = 65
We have 1 Yes & 1 No decisions at <= 65 and 8 Yes & 4 No decisions at > 65
Entropy(Decision|Temp<=65)
= – p(No) . log2p(No) – p(Yes) . log2p(Yes)
= -(1/2).log2(1/2) – (1/2).log2(1/2)
= 1
Entropy(Decision|Temp>65)
= -(4/12).log2(4/12) – (8/12).log2(8/12)
= 0.918
Gain(Decision, Temp<> 65)
= 0.940 – (2/14).1 – (12/14).(0.918)
= 0.010
SplitInfo(Decision, Temp<> 65)
= -(2/14).log2(2/14) -(12/14).log2(12/14)
= 0.591
GainRatio(Decision, Temp<> 65)
= 0.010/0.591
= 0.017
Similarly, calculate the Gains and Gain Ratios for all other values of Temp.
We found out that the Gain was maximum for Temp = 83
Gain(Decision, Temp <> 83) = 0.113
GainRatio(Decision, Temp <> 83) = 0.305
Comparison of Gains and Gain Ratios
Attribute | Gain | Gain Ratio |
Wind | 0.049 | 0.049 |
Outlook | 0.247 | 0.156 |
Humidity <> 80 | 0.101 | 0.107 |
Temp <> 83 | 0.113 | 0.305 |
If we use Gain, Outlook will be the root node. (Because it has the highest Gain value)
Similarly, if we use Gain Ratio, Temp will be the root node.
We will proceed using the Gain.
Outlook = Sunny
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | 85 | 85 | Weak | No |
2 | Sunny | 80 | 90 | Strong | No |
8 | Sunny | 72 | 95 | Weak | No |
9 | Sunny | 69 | 70 | Weak | Yes |
11 | Sunny | 75 | 70 | Strong | Yes |
If humidity > 80, decision is ‘No’
If humidity <= 80, decision is ‘Yes’
Outlook = Overcast
Day | Outlook | Temp. | Humidity | Wind | Decision |
3 | Overcast | 83 | 78 | Weak | Yes |
7 | Overcast | 64 | 65 | Strong | Yes |
12 | Overcast | 72 | 90 | Strong | Yes |
13 | Overcast | 81 | 75 | Weak | Yes |
All decisions are ‘Yes’
Outlook = Rain
Day | Outlook | Temp. | Humidity | Wind | Decision |
4 | Rain | 70 | 96 | Weak | Yes |
5 | Rain | 68 | 80 | Weak | Yes |
6 | Rain | 65 | 70 | Strong | No |
10 | Rain | 75 | 80 | Weak | Yes |
14 | Rain | 71 | 80 | Strong | No |
If Wind = Weak, decision is ‘No’
If Wind = Strong, decision is ‘Yes’
So, this is our final Decision Tree using C4.5 algorithm.
Advantages of C4.5 over ID3
C4.5 is an evolution of ID3 by the same author (Quinlan). He made sure that the bottlenecks are not repeated in C4.5 that were present in ID3. Following are the improvements he made in C4.5
1. It can handle both continuous and discrete variables.
2. It can handle missing values by marking them as ‘?’. They are not used in Gain and Entropy calculations.
3. Prunes the tree and thereby avoids ‘overfitting’.
C5 Algorithm:
C5 algorithm is an evolved version of C4.5 algorithm. It was also created by Ross Quinlan from Stanford University. It has many advantages over C4.5:
1. It is much faster than C4.5 in computation.
2. It supports more number of data types which previous algorithms do not support. For ex: Date, timestamp, etc.
3. It is more memory-efficient.
4. It improves the trees by Boosting which gives more accuracy to the trees.
C5 Algorithm works the same way as C4.5 algorithm does. The underlying basic concepts are same. It also makes use of Information gain and Gain Ratio to make decision trees. Just a few modifications that we have mentioned above!
For the coding related to CART, please visit the previous blogs, the links to which are as follows:
CART – Regresssion Trees – https://insightimi.wordpress.com/2020/03/15/cart-regression-tree-from-scratch-with-a-hands-on-examplein-r/
CART – Classification – https://insightimi.wordpress.com/2020/03/08/classification-and-regression-tree/
Conclusion
So, Gini Index is used in CART and is mainly used for Classification purpose only. ID3 was the first algorithm in its family which got superseded by its successor, C4.5. C4.5 was a major improvement in ID3 and these are the most commonly used algorithms in the industry. C5.0 was yet another improvement in this family of algorithms. It brought some more modifications to C4.5.
You should try making a C4.5 decision tree using Gain Ratio instead of Gain. Try it and let us know in the comments if you face any issue.