Decision Tree

A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks.

It has hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.

Â

note: Parametric supervised learning refers toÂa type of machine learning where the model assumes a specific functional form and estimates a fixed number of parameters from the training data

  import pandas as pd

we import pandas as pd, thne we read the csv file called ‘500hits.csv’

  df = pd.read_csv('500hits.csv' , encoding='latin-1')
  df.head()

next we drop the columns ‘Players and ‘cs’

  df = df.drop(columns = ['PLAYER', 'CS'])

Here we select all rows and columns 0 through 12 (not including 13)  and assign it to X

  X = df.iloc[:, 0:13]

Here we select all rows : from column index 13, and assign it to y

  y = df.iloc[:,13]

Next we import train_test_split from sklearn.model_selection

  from sklearn.model_selection import train_test_split

Then we split the data into testing and train sets.

  X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17, test_size=0.2)

Next we check the snape or dimension of the data, .shape returns (rows, columns)

  X_train.shape
  X_test.shape

Next we import DecisionTreeClassifier

  from sklearn.tree import DecisionTreeClassifier and assign it to the variable "dtc"
  dtc = DecisionTreeClassifier()
  dtc.get_params()

Next we train the model on our data using the .fit()

  dtc.fit(X_train, y_train)

Next after traing we predict, i.e we evaluate our models performance using the test data.

  y_pred = dtc.predict(X_test)

Here, we import the confusion_matrix function from scikit-learn.

It is used to evaluate classification models by comparing the predicted labels with the actual labels.

  from sklearn.metrics import confusion_matrix
  print(confusion_matrix(y_test, y_pred))
  from sklearn.metrics import classification_report
  print(classification_report(y_test, y_pred))

Here we return an array of numbers showing how important each feature was in buiding the decision tree “dtc”

  dtc.feature_importances_
  features = pd.DataFrame(dtc.feature_importances_, index = X.columns)
  features.head(15)

Higher values of ccp_alpha lead to more pruning in the decision tree.
This is part of cost-complexity pruning, which removes less important branches to help prevent overfitting.

Â

ccp_alpha(Cost Complexity Pruning Alpha) is a parameter that controls how much the tree should be simplified.

  dtc2 = DecisionTreeClassifier(criterion='entropy', ccp_alpha= 0.04)

SO next after adding the ccp_alpha, we train, test and evaluate again to see if theres improvement.

  dtc2.fit(X_train, y_train)
  y_pred2 = dtc2.predict(X_test)
  print(confusion_matrix(y_test, y_pred2))
  print(classification_report(y_test, y_pred2))
  features2 = pd.DataFrame(dtc2.feature_importances_, index = X.columns)
  features2.head(15)

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Leave a Reply

Your email address will not be published. Required fields are marked *