Decision Tree Model for Classification

DTree [Decision Trees] are Non-Linear ML models that can be used for Supervised ML i.e. Classification and Regression.
- They are non-parametric models i.e. unlike Linear Regression they don’t learn for any weights [i.e. intercepts or coefficients/slopes].
- They provides good Explainability and are easy to interpret
- They are prone to Overfit

Types of Tree Model Algorithms
1. ID3
2. C4.5
3. C5.0
4. CART [classification and regression trees]*
5. CHAID

Split Criterion for the Models (To Build the Tree)
1. Gini Impurity*
2. Entropy [state of disorder/uncertainty]
3. Information Gain = Entropy [parent node] – Entropy [child nodes]

Early Stopping Criterion (To Avoid Overfit by performing Pruning)
1. Max Depth
2. Min Split
3. Min Bucket
4. Minimum Impurity Decrease

Python Program

EDA + Feature Engineering Phase

#read the data


import pandas as pd

df=pd.read_csv('https://raw.githubusercontent.com/rktrojan/DataSciencePython/main/DataFiles/diabetes.csv', header=None)

df.columns = ['feat-1', 'feat-2' , 'feat-3', 'feat-4', 'feat-5', 'feat-6','feat-7', 'feat-8' , 'target']

df


df_X = df.drop('target',axis=1)

df_Y = df.target


from sklearn.model_selection import train_test_split


# Split dataset into training set and test set

X_train, X_test, y_train, y_test = train_test_split(df_X, df_Y, test_size=0.3,random_state=99)

Training Phase

# import the classifier model class

from sklearn.tree import DecisionTreeClassifier  


  
# create a classifier object - for CART Model


classifier = DecisionTreeClassifier(criterion=gini,  max_depth=7, random_state = 11) 

#max_depth is maximum number of levels in the tree

#train the model
classifier.fit(X_train, y_train)

#training accuracy

print('Train Accuracy : ' , classifier.score(X_train, y_train)
)

Train Accuracy : 0.9385474860335196

#Here you can also check the relative importance of features used to train the Model

classifier.feature_importances_

array([0.01711065, 0.58418702, 0.01197044, 0.01900979, 0.        ,
       0.15963914, 0.07359983, 0.13448312])

Plot Decision Trees

import pandas as pd
import numpy as np

#import this for tree plot

import matplotlib.pyplot as plt
from sklearn import tree


plt.figure(figsize=(55,40))

info = tree.plot_tree(classifier, 
                      filled=True, 
                      rounded=True,
                      precision=3,
                      fontsize=25
                    )

# filled = True means:
        paint/color the nodes to indicate majority class for classification

So here, 
Orange color indicates Class 0
Blue color indicates Class 1

Testing Phase

#prediction on test data

y_pred = classifier.predict(X_test)
y_pred



#GINI IMPURITY --- should be less

#BINARY TREE      ---- CART models
#MULTIWAY TREE    ---- CHAID models

Model Evaluation

from sklearn import metrics
print("Test Accuracy:", metrics.accuracy_score(y_test, y_pred))

Test Accuracy: 0.6796536796536796

# we can also use Confusion mAtrix for further analysis

CONCLUSION:

Train_Accuracy = .9385
Test_Accuracy  = .6796


With huge difference between train and test accuracy, we can see that DTree model here is Overfitted.
We can try Tree Pruning and hyperparameter tuning approaches here.

Decision Tree Model for Classification

Python Program

EDA + Feature Engineering Phase

Training Phase

Plot Decision Trees

Testing Phase

Model Evaluation

Like this:

Related

Rahul Aggarwal

http://guardiancoder.in

Leave a ReplyCancel reply

Decision Tree Model for Classification

Python Program

EDA + Feature Engineering Phase

Training Phase

Plot Decision Trees

Testing Phase

Model Evaluation

Share this post:

Like this:

Related

Rahul Aggarwal

http://guardiancoder.in

Leave a ReplyCancel reply

Discover more from Rahul Aggarwal's EdTech