Early Stopping Criterion (To Avoid Overfit by performing Pruning)
Max Depth
Min Split
Min Bucket
Minimum Impurity Decrease
Python Program
EDA + Feature Engineering Phase
#read the data
import pandas as pd
df=pd.read_csv('https://raw.githubusercontent.com/rktrojan/DataSciencePython/main/DataFiles/diabetes.csv', header=None)
df.columns = ['feat-1', 'feat-2' , 'feat-3', 'feat-4', 'feat-5', 'feat-6','feat-7', 'feat-8' , 'target']
df
Dataset
df_X = df.drop('target',axis=1)
df_Y = df.target
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(df_X, df_Y, test_size=0.3,random_state=99)
Training Phase
# import the classifier model classfrom sklearn.tree import DecisionTreeClassifier
# create a classifier object - for CART Model
classifier = DecisionTreeClassifier(criterion=gini, max_depth=7, random_state = 11)#max_depth is maximum number of levels in the tree
#Here you can also check the relative importance of features used to train the Model
classifier.feature_importances_array([0.01711065, 0.58418702, 0.01197044, 0.01900979, 0. ,
0.15963914, 0.07359983, 0.13448312])
Plot Decision Trees
import pandas as pd
import numpy as np
#import this for tree plotimport matplotlib.pyplot as plt
from sklearn import tree
plt.figure(figsize=(55,40))
info = tree.plot_tree(classifier,
filled=True,
rounded=True,
precision=3,
fontsize=25
)
# filled = True means:
paint/color the nodes to indicate majority class for classificationSo here,
Orange color indicates Class 0Blue color indicates Class 1
Decision Tree Graph
Testing Phase
#prediction on test data
y_pred = classifier.predict(X_test)
y_pred
#GINI IMPURITY --- should be less
#BINARY TREE ---- CART models
#MULTIWAY TREE ---- CHAID models
predicted output
Model Evaluation
from sklearn import metrics
print("Test Accuracy:", metrics.accuracy_score(y_test, y_pred))
Test Accuracy: 0.6796536796536796
# we can also use Confusion mAtrix for further analysis
CONCLUSION:
Train_Accuracy = .9385
Test_Accuracy = .6796
With huge difference between train and test accuracy, we can see that DTree model here is Overfitted.
We can try Tree Pruning and hyperparameter tuning approaches here.
Leave a Reply