Decision Tree Classifier and Cost Computation Pruning using Python

A complete hands on guide towards building, visualizing, and fine tuning a decision tree using cost computation pruning in Python

Photo by Jens Lelie on Unsplash

Introduction

Decision tree classifiers are supervised learning models that are useful when we care about interpretability. Think of it like, breaking down the data by making decisions based on multiple questions at each level. This is one of the widely used algorithms for handling classification problems. To understand it better let’s take a look at the example below.

The sample decision tree used to decide upon the activity of a particular day. Adapted from Python Deeper Insights into Machine Learning (Raschka, Julian and Hearty, 2016, pp.83, 88, 89)

Decision Tree usually consists of:

Root Node — Represents the sample or the population that gets divided into further homogeneous groups
Splitting — Process of dividing the nodes into two sub-nodes
Decision Node — When a sub-node splits into further sub-nodes based on a certain condition, it is called a decision node
Leaf or Terminal Node — Sub nodes that don’t split further
Information Gain — To split the nodes using a condition (say most informative feature) we need to define an objective function that can be optimized. In the decision tree algorithm, we tend to maximize the information gain at each split. Three impurity measures are used commonly in measuring the information gain. They are the Gini impurity, Entropy, and the Classification error

Example of a Decision Tree with leaves and branches. Reference — Developed by the author using Lucid Chart

Understanding the Mathematics

To understand how decision trees are developed, we need to have a deeper understanding of how information gain is maximized at each step using one of the impurity measures at each step. Let’s take an example where we have training data comprising student information like gender, grade, and a dependent variable or a classifier variable the identifies if a student is a foodie or not. We have with us the following information outlined below.

Total Students — 20
Total Students Categorized as Foodie — 10
Total Student not Categorized as Foodie — 10
P(Foodie), probability of a student being foodie = (10/20) = 0.5
Q(Not Foodie, or 1-P), probability of a student not being foodie = (10/20) = 0.5

Let’s divide the students based on their gender into two nodes and recalculate the above metrics.

Male Students (Node A)

Total Students — 10
Total Students Categorized as Foodie — 8
Total Student not Categorized as Foodie — 2
P(Foodie), probability of a student being foodie = (8/10) = 0.8
Q(Not Foodie, or 1-P), probability of a student not being foodie = (2/10) = 0.2

Female Students (Node B)

Total Students — 10
Total Students Categorized as Foodie — 4
Total Student not Categorized as Foodie — 6
P(Foodie), probability of a student being foodie = (4/10) = 0.4
Q(Not Foodie, or 1-P), probability of a student not being foodie = (6/10) = 0.6

Gini Index (GIn)for Node A or the Male Students = P² + Q², where P and Q are the probability of a student being a foodie and a non-foodie. GIn (Node A) = 0.8² + 0.2² = 0.68

Gini Impurity (GIp)for Node A = 1-Gini Index = 1–0.68 = 0.32

Gini Index (GIn)for Node B or the Female Students = P² + Q², where P and Q are the probability of a student being a foodie and a non-foodie. GIn (Node B) = 0.4² + 0.6²= 0.52

Gini Impurity (GIp)for Node B= 1-Gini Index = 1–0.52 = 0.48

What we observe above is that when we split the students based on their gender (Male and Female) into Nodes A and B respectively, we have a Gini impurity score for two nodes separately. Now to decide if gender is the right variable to split the students into foodie and non-foodie we need a weighted Gini Impurity Score which is calculated using the formula outlined below.

Weighted Gini Impurity = (Total samples in Node A / Total Samples in the data set) * Gini Impurity (Node A) + (Total samples in Node B/ Total Samples in the data set) * Gini Impurity (Node B)

Using this formulation to calculate the Weighted Gini Impurity Score of the above example, Weighted Gini Impurity Score when Students are divided based on Gender = (10/20)*0.32 + (10/20)*0.48 = 0.4

A classification problem works with multiple independent variables. The variables can be categorical or continuous. Decision trees are well adapted to handle variables of different data types. A decision tree algorithm takes into consideration all possible variables while deciding the split of each node. Variables using which maximum Weighted Impurity Gain can be achieved, is used as a decision variable for a particular node .

In the example above the Weighted Impurity Gain using “gender” as the decision variable is 0.4, however, say using “grade” as the decision variable we manage to achieve a Weighted Impurity Gain of 0.56, the algorithm will use “grade” as the decision variable for creating the first split. A similar methodology is followed for all subsequent steps until every node is homogeneous.

Quick facts about the Decision Tree Algorithm

Decision trees are prone to overfitting as the algorithm continues to split nodes into sub-nodes till each node becomes homogeneous
The accuracy of training data is much higher when compared to the test set, hence decision trees should be pruned to prevent the model from overfitting. Pruning can be achieved by controlling the depth of the tree, maximum/minimum number of samples in each node, minimum impurity gain for a node to split, and the maximum leaf nodes
Python allows users to develop a decision tree using the Gini Impurity or Entropy as the Information Gain Criterion
A decision tree can be fine-tuned using a Grid Search or a Randomized Search CV. CV stands for cross-validation

Visual Comparison of Three Different Impurity Criteria

The code snippet outlined below provides a visual comparison of different impurity criterion and how they change with different probability values. Note the code below is adapted from Python: Deeper Insights into Machine Learning by S.Raschka, D.Julian, and J.Hearty, 2016.

import matplotlib.pyplot as pltimport numpy as np#-----Calculating Gini Index
def gini(p):
    return (p)*(1 - (p)) + (1 - p)*(1 - (1-p))#-----Calculating Entropy
def entropy(p):
    return - p*np.log2(p) - (1 - p)*np.log2((1 - p))#-----Calculating Classification Error
def classification_error(p):
    return 1 - np.max([p, 1 - p])#----Creating a Numpy Array of probability values from 0 to 1, with an increment of 0.01
x = np.arange(0.0, 1.0, 0.01)#---Obtaining Entropy for different values of p
ent = [entropy(p) if p != 0 else None for p in x]#---Obtaining scaled entropy
sc_ent = [e*0.5 if e else None for e in ent]#--Classification Error
err = [classification_error(i) for i in x]#--Plottingfig = plt.figure();
plt.figure(figsize=(10,8));
ax = plt.subplot(111);for i, lab, ls, c, in zip([ent, sc_ent, gini(x), err], ['Entropy', 'Entropy (scaled)','Gini Impurity',
                                                        'Misclassification Error'],['-', '-', '--', '-.'],
                          ['black', 'darkgray','blue', 'brown', 'cyan']):
    line = ax.plot(x, i, label=lab,
    linestyle=ls, lw=2, color=c)

ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15), ncol=3, fancybox=True, shadow=False)
ax.axhline(y=0.5, linewidth=1, color='k', linestyle='--')
ax.axhline(y=1.0, linewidth=1, color='k', linestyle='--')
plt.ylim([0, 1.1])
plt.xlabel('p(i=1)')
plt.ylabel('Impurity Index')
plt.show()

Change in impurity for different probability values. Reference — Output of the above code snippet

Hands On Exercise

The problem statement aims at developing a classification model to predict the quality of red wine. Details about the problem statement can be found here . This is a classic example of a multi-class classification problem. Note that all machine learning models are sensitive to Outliers, hence features/independent variables consisting of outliers should be treated before building the tree.

An important aspect of different features/independent variables is how they interact. Pearson correlation can be used to determine the degree of association between two features from a data set. However, for a decision-based algorithm like a decision tree, we won’t be dropping highly correlated variables.

#---------------------------------------------Importing Required Libraries-----------------------------------
%matplotlib inlineimport numpy as np
import pandas as pdfrom sklearn.tree import DecisionTreeClassifierimport numpy as np
import pandas as pd
import seaborn as snssns.set(color_codes=True)from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split #--------------splitting data into test and train
from sklearn.tree import DecisionTreeClassifier #-----------Building decision tree modelfrom sklearn import metrics
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score, confusion_matrix #-----model validation scores
%matplotlib inlinefrom IPython.display import display #---------------------for displaying multiple data frames in one outputfrom sklearn.feature_extraction.text import CountVectorizer  #DT does not take strings as input for the model fit stepimport missingno as msno_plot #--------------plotting missing valueswine_df = pd.read_csv('winequality-red.csv',sep=';')

Quick descriptive statistics of the data

wine_df.describe().transpose().round(2)

Output of the .describe() function

Checking Missing Values

#-------------------------------------------Barplot of non-missing values--------------------------------
plt.title('#Non-missing Values by Columns')
msno_plot.bar(wine_df);

Plot illustrating the count of non-missing values in the dataset

Outlier check & Treatment

#--Checking Outliers
plt.figure(figsize=(15,15))
pos = 1
for i in wine_df.columns:
    plt.subplot(3, 4, pos)
    sns.boxplot(wine_df[i])
    pos += 1

Box plot illustrating the presence of outlier in the data

col_names=['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol']display(col_names)for i in col_names:
    q1, q2, q3 = wine_df[i].quantile([0.25,0.5,0.75])
    IQR = q3 - q1
    lower_cap=q1-1.5*IQR
    upper_cap=q3+1.5*IQR
    wine_df[i]=wine_df[i].apply(lambda x: upper_cap if x>(upper_cap) else (lower_cap if x<(lower_cap) else x))

Outliers above are winsorized using Q1–1.5*IQR and Q3+1.5*IQR values. Q1, Q3, and IQR stand for Quartile 1, Quartile 3, and Inter Quartile Range respectively.

sns.pairplot(wine_df);

Pairplot illustrating the interaction of different variables

Understanding the relationship between different variables. Note — In Decision Trees, we need not remove highly correlated variables as nodes are divided into sub-nodes using one independent variable only, hence even if two or more variables are highly correlated, the variable producing the highest information gain will be used for the analysis.

plt.figure(figsize=(10,8))
sns.heatmap(wine_df.corr(),
            annot=True,
            linewidths=.5,
            center=0,
            cbar=False,
            cmap="YlGnBu")
plt.show()

Heat map illustrating the correlation of different attributes

Classification problems are sensitive to class imbalance. A class imbalance is a scenario when the dependent attribute has a higher proportion of ones than zeros or vice versa. In a multiclass problem, class imbalance occurs when the proportion of one of the class values is much higher. Class balance is induced by combining values of the attribute “quality” which is the dependent variable in this problem statement.

plt.figure(figsize=(10,8))
sns.countplot(wine_df['quality']);

Image illustrating the count of records across different wine quality

wine_df['quality'] = wine_df['quality'].replace(8,7)
wine_df['quality'] = wine_df['quality'].replace(3,5)
wine_df['quality'] = wine_df['quality'].replace(4,5)
wine_df['quality'].value_counts(normalize=True)

Data is divided into train and test set to check the accuracy of the model and look for overfitting or under fitting if any.

# splitting data into training and test set for independent attributesfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test =train_test_split(wine_df.drop('quality',axis=1), wine_df['quality'], test_size=.3,
                                                   random_state=22)
X_train.shape,X_test.shape

A decision tree model is developed using the Gini criterion. Note that for the sake of simplicity we have pruned the tree to a maximum depth of three. This will help us visualize the tree and tie it back to the concepts we covered in the initial segments.

clf_pruned = DecisionTreeClassifier(criterion = "gini", random_state = 100,
                               max_depth=3, min_samples_leaf=5)
clf_pruned.fit(X_train, y_train)

Note that the following parameters can be tuned to improve the model output (Scikit Learn, 2019).

criterion — Gini impurity is used to decide the variables based on which root node and following decision nodes should be split
class_weight — None; All classes are assigned weight 1
max_depth — 3; Pruning is done. When “None”, it signifies that nodes will be expanded till all leaves are homogeneous
max_features — None; All features or independent variables are considered while deciding split of a node
max_leaf_nodes — None;
min_impurity_decrease — 0.0; A node is split only when the split ensures a decrease in the impurity of greater than or equal to zero
min_impurity_split — None;
min_samples_leaf — 1; Minimum number of samples required for a leaf to exists
min_samples_split — 2; If min_samples_leaf =1, it signifies that the right and the left node should have 1 sample each, i.e. the parent node or the root node should have at least two samples
splitter — ‘best’; Strategy used to choose the split at each node. Best ensure that all features are considered while deciding the split

from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus
import graphvizxvar = wine_df.drop('quality', axis=1)
feature_cols = xvar.columnsdot_data = StringIO()
export_graphviz(clf_pruned, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1','2'])from pydot import graph_from_dot_data
(graph, ) = graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

Image illustrating the decision tree model when the tree is pruned at depth =3

preds_pruned = clf_pruned.predict(X_test)
preds_pruned_train = clf_pruned.predict(X_train)print(accuracy_score(y_test,preds_pruned))
print(accuracy_score(y_train,preds_pruned_train))

The accuracy score of the model on training and test data turned out to be 0.60 and 0.62 respectively.

Feature importance refers to a class of techniques for assigning scores to input features of a predictive model that indicates the relative importance of each feature when making a prediction.

## Calculating feature importancefeat_importance = clf_pruned.tree_.compute_feature_importances(normalize=False)feat_imp_dict = dict(zip(feature_cols, clf_pruned.feature_importances_))
feat_imp = pd.DataFrame.from_dict(feat_imp_dict, orient='index')
feat_imp.rename(columns = {0:'FeatureImportance'}, inplace = True)
feat_imp.sort_values(by=['FeatureImportance'], ascending=False).head()

Top 5 features impacting the decision tree splits

The DecisionTreeClassifier() provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfitting. Think of it as a scenario where we explicitly define the depth and the maximum leaves in the tree. However, the biggest challenge is to determine the optimum depth and leaves a tree should contain. In the example above we used max_depth=3, min_samples_leaf=5. These numbers are just example figures to see how the tree behaves. But if in reality we were asked to work on this model and come up with an optimum value for the model parameters, it is challenging but not impossible (decision tree models can be fine-tuned using GridSearchCV algorithm).

The other way of doing it is by using the Cost Complexity Pruning (CCP).

Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned (Scikit Learn, n.d.).

In simpler terms, cost complexity is a threshold value. The model split a node further into its child node only when the overall impurity of the model is improved by a value greater than this threshold else it stops.

Lower the CCP, lower is the impurity. How?

When the value of CCP is lower the model will split a node into its child nodes even if the impurity doesn’t decrease much. This is evident as the depth of the tree increases, i.e. as we go down a decision tree, we will find that split doesn’t contribute much to the change in the overall impurity of the model. However higher split ensures that the classes are categorized correctly, i.e. accuracy is more.

When CCP values are low, a higher number of nodes are created. Higher the nodes, the higher is the depth of the tree as well.

The code below (Scikit Learn, n.d.) illustrates how alpha can be tuned to obtain a model with an improved accuracy score.

path = model_gini.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impuritiesfig, ax = plt.subplots(figsize=(16,8));
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post");
ax.set_xlabel("effective alpha");
ax.set_ylabel("total impurity of leaves");
ax.set_title("Total Impurity vs effective alpha for training set");

Image illustrating the variation of impurity with change in alpha

Let’s understand the variation of depth and the number of nodes with change in alpha.

clfs = clfs[:-1]ccp_alphas = ccp_alphas[:-1]node_counts = [clf.tree_.node_count for clf in clfs]depth = [clf.tree_.max_depth for clf in clfs]fig, ax = plt.subplots(2, 1,figsize=(16,8))ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

Image illustrating the change of depth and #nodes with change in alpha

Understanding the variation in accuracy when alpha is increased.

fig, ax = plt.subplots(figsize=(16,8)); #-----------------Setting size of the canvas
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
        drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
        drawstyle="steps-post")
ax.legend()
plt.show()

Image illustrating the variation in accuracy scores when alpha is increased

i = np.arange(len(ccp_alphas))
ccp = pd.DataFrame({'Depth': pd.Series(depth,index=i),'Node' : pd.Series(node_counts, index=i),\
                    'ccp' : pd.Series(ccp_alphas, index = i),'train_scores' : pd.Series(train_scores, index = i),
                   'test_scores' : pd.Series(test_scores, index = i)})ccp.tail()ccp[ccp['test_scores']==ccp['test_scores'].max()]

The above code provides the cost computation pruning value that produces the highest accuracy in the test data.

Reference

Raschka, S., Julian, D. and Hearty, J. (2016). Python : deeper insights into machine learning : leverage benefits of machine learning techniques using Python : a course in three modules . Birmingham, Uk: Packt Publishing, pp.83, 88, 89.
‌ Scikit-learn: Machine Learning in Python , Pedregosa et al. , JMLR 12, pp. 2825–2830, 2011.
Scikit Learn (2019). sklearn.tree.DecisionTreeClassifier — scikit-learn 0.22.1 documentation . [online] Scikit-learn.org. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html.
Scikit Learn (n.d.). Post pruning decision trees with cost complexity pruning . [online] Available at: https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html#sphx-glr-auto-examples-tree-plot-cost-complexity-pruning-py.

Decision Tree Classifier and Cost Computation Pruning using Python