Decision Tree Model for Classification

Decision Tree Model for Classification

  • DTree [Decision Trees] are Non-Linear ML models that can be used for Supervised ML i.e. Classification and Regression.
    • They are non-parametric models i.e. unlike Linear Regression they don’t learn for any weights [i.e. intercepts or coefficients/slopes].
    • They provides good Explainability and are easy to interpret
    • They are prone to Overfit



  • Types of Tree Model Algorithms
    1. ID3
    2. C4.5
    3. C5.0
    4. CART [classification and regression trees]*
    5. CHAID
  • Split Criterion for the Models (To Build the Tree)
    1. Gini Impurity*
    2. Entropy [state of disorder/uncertainty]
    3. Information Gain = Entropy [parent node] – Entropy [child nodes]
  • Early Stopping Criterion (To Avoid Overfit by performing Pruning)
    1. Max Depth
    2. Min Split
    3. Min Bucket
    4. Minimum Impurity Decrease

Python Program

EDA + Feature Engineering Phase

#read the data


import pandas as pd

df=pd.read_csv('https://raw.githubusercontent.com/rktrojan/DataSciencePython/main/DataFiles/diabetes.csv', header=None)

df.columns = ['feat-1', 'feat-2' , 'feat-3', 'feat-4', 'feat-5', 'feat-6','feat-7', 'feat-8' , 'target']

df

Dataset

df_X = df.drop('target',axis=1)

df_Y = df.target

from sklearn.model_selection import train_test_split


# Split dataset into training set and test set

X_train, X_test, y_train, y_test = train_test_split(df_X, df_Y, test_size=0.3,random_state=99)

Training Phase

# import the classifier model class

from sklearn.tree import DecisionTreeClassifier  


  
# create a classifier object - for CART Model


classifier = DecisionTreeClassifier(criterion=gini,  max_depth=7, random_state = 11) 

#max_depth is maximum number of levels in the tree
#train the model
classifier.fit(X_train, y_train)
#training accuracy

print('Train Accuracy : ' , classifier.score(X_train, y_train)
)

Train Accuracy : 0.9385474860335196
#Here you can also check the relative importance of features used to train the Model

classifier.feature_importances_

array([0.01711065, 0.58418702, 0.01197044, 0.01900979, 0.        ,
       0.15963914, 0.07359983, 0.13448312])

Plot Decision Trees

import pandas as pd
import numpy as np

#import this for tree plot

import matplotlib.pyplot as plt
from sklearn import tree

plt.figure(figsize=(55,40))

info = tree.plot_tree(classifier, 
                      filled=True, 
                      rounded=True,
                      precision=3,
                      fontsize=25
                    )
# filled = True means:
        paint/color the nodes to indicate majority class for classification

So here, 
Orange color indicates Class 0
Blue color indicates Class 1
Decision Tree Graph

Testing Phase

#prediction on test data

y_pred = classifier.predict(X_test)
y_pred



#GINI IMPURITY --- should be less

#BINARY TREE      ---- CART models
#MULTIWAY TREE    ---- CHAID models
predicted output

Model Evaluation

from sklearn import metrics
print("Test Accuracy:", metrics.accuracy_score(y_test, y_pred))

Test Accuracy: 0.6796536796536796

# we can also use Confusion mAtrix for further analysis
CONCLUSION:

Train_Accuracy = .9385
Test_Accuracy  = .6796


With huge difference between train and test accuracy, we can see that DTree model here is Overfitted.
We can try Tree Pruning and hyperparameter tuning approaches here.

Rahul Aggarwal
http://guardiancoder.in

Senior Data Scientist and Gen-AI Engineer #DataScience #AI #RNN #CNN #GenAI #ChatGPT #LLMs

Leave a Reply

Discover more from Rahul Aggarwal's EdTech

Subscribe now to keep reading and get access to the full archive.

Continue reading