extra trees classifier

The Extra Trees Classifier is an ensemble machine learning methid that cimbines predictions from many individual trees.

 

Aggregates the results from group of decision trees (Like a random forest)
Difference
1. ETC randomly selects the value to split features unlike a DTC which looks for the best
2. Makes ETC More random + Faster Algorithm which can help with noisy data

We import make_classification from sklearn.datasets. module in scikit-learn.

 

make_classification is a utility function in scikit-learn used to generate synthetic datasets for claasification tasks.

				
					from sklearn.datasets import make_classification
				
			

Here, we generate a synthetic classification dataset using the make_classification function and assigns it to X and y.

X is the Numpy array representing the features (input data).

y is the Numpy array representing the target labels i. e the classes our model will try to predict.

 

n_features=11 specifies that X will have 11 columns

				
					X, y = make_classification(n_features=11, random_state=21)
				
			

Here, we import the train_test_split function from the sklearn.model_selection module in scikit-learn.

It is used to split datasets into training and testing subsets.

				
					from sklearn.model_selection import train_test_split
				
			
				
					X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=16)
				
			

Here, we import the ExtraTreesClassifier class from the sklearn.ensemble module in scikit-learn.

				
					from sklearn.ensemble import ExtraTreesClassifier
				
			

Here we create an instance of the ExtraTreesClassifier model and assign it to the variable ETC

				
					ETC = ExtraTreesClassifier(random_state=0)
				
			

This is the training step for our model (ExtraTreesClassifier).

we use the .fit() to train the model.

				
					ETC.fit(X_train, y_train)
				
			
				
					from sklearn.model_selection import cross_val_score
				
			

Here, we perform cross-validation on our ExtraTreesClassifier model. and we calculate the average accuracy across the folds.

 

  • cv=5: This sets the number of folds for cross-validation to 5. The data will be split into 5 equal parts. The model will be trained 5 times, each time using 4 parts for training and 1 part for testing.

  • n_jobs=-1: This parameter tells scikit-learn to use all available CPU cores for parallel processing. This can significantly speed up the cross-validation process, especially for large datasets or computationally intensive models.

				
					cross_val_score(ETC, X_train, y_train, scoring='accuracy', cv=5, n_jobs=-1).mean()
				
			

Here, we perform some hyperparameter tuning.

we define a dictionary named param_grid.

This dictionary is used to specify the hyperparamters and their possible values that we want to test when performing hyperparameter tuning.

 

				
					param_grid = {
    'n_estimators' : [100, 300, 500],
    'min_samples_leaf': [5,10,25],
    'max_features': [2,3,4,6]
}
				
			

Here we import GridSearcCV class from the sklearn.model_selection module.

GridSearchCV is a powerful tool used for hyperparameter tuning. It systematically searches through a predefined set of hyperparameter values in our param_grid.

				
					from sklearn.model_selection import GridSearchCV
				
			

This line initializes a GridSearchCV object, preparing it to perform a systematic search for the best hyperparameters for our ExtraTreesClassifier

				
					ETC2 = GridSearchCV(ETC, param_grid, cv=3, n_jobs=-1)
				
			

Then we train using the .fit()

 

The purpose of training a machine learning model is to teach it to recognize patterns and relationships within a dataset so it can make accurate predictions or decisions on new, unseen data

				
					ETC2.fit(X_train, y_train)
				
			

Next we check the best params i.e it tells us the exact combination of parameters that resulted in the highest accuracy.

We also check the best score.

				
					ETC2.best_params_
				
			
				
					ETC2.best_score_
				
			

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Leave a Reply

Your email address will not be published. Required fields are marked *