Gradient boosting classifier
Gradient Boosting is an ensemble technique that builds a strong model by combining multiple weak decision trees. While it may seem similar to a Random Forest, there’s a key difference: in Random Forests, each tree is built independently, whereas in Gradient Boosting, trees are built sequentially, with each new tree correcting the errors of the previous ones.
The goal is to minimize the loss function at each stage, gradually improving the model’s performance. Gradient Boosting is versatile and can be used for both regression and classification tasks—but in this example, we focus specifically on classification.
we import pandas as pd and we also imprt datasets from sklearn.
import pandas as pd
from sklearn import datasets
Next we load the wine dataset from scikit-learn.
This returns a pandas DataFrame
wine = datasets.load_wine(as_frame=True)
Next we assign the ‘data’ to X.
X holds the feature matrix.
X = wine['data']
Here we assign the target labels from the wine dataset to the varibale y.
y = wine['target']
Next we import train_test_split to split our data into training and testing set.
from sklearn.model_selection import train_test_split
We import cross_val_score, which is used to evaluate the performance of a model using cross-validation.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=17)
Here, we import the cross_val_score
function from scikit-learn, which is used to evaluate the performance of a model using cross-validation.
from sklearn.model_selection import cross_val_score
Next we imports the GradientBoostingClassifier
from scikit-learn’s ensemble
module.
gbr = GradientBoostingClassifier()
Next we train the modelwith the train data using the .fit() method.
gbr.fit(X_train, y_train)

cross_val_score(gbr, X_train, y_train, scoring='accuracy', cv=5, n_jobs=-1).mean()

Hyperparameters in GradientBoostingClassifier:
learning_rate
: Controls the contribution of each tree to the final model. Lower values reduce overfitting risk by slowing down the learning process, often requiring more trees to compensate.criterion
: The loss function used to evaluate and determine the best feature and threshold to split the data at each node.max_depth
: Specifies the maximum depth of each individual decision tree. Shallower trees help prevent overfitting but may underfit.n_estimators
: The total number of trees (iterations) used in the boosting process. More estimators usually improve performance but increase computation time.init
: An initial estimator used to make the first predictions before boosting begins. By default, this is based on the log-odds of the target class (converted to probabilities).
Â
Here, we define the range of hyperparameters for a gridsearch to find the best combination of values for our GradientBoostingClassifier.
param_grid = { 'n_estimators':[10, 50, 100, 500], 'learning_rate':[0.0001, 0.001, 0.01, 0.1, 1.0], 'max_depth':[3,7, 9], }
Next we import the GridSearchCv, which is used for systematically searching through a specified set of hyperparameter combinations to find the best one for a machine learning model.
from sklearn.model_selection import GridSearchCV
gbr2 = GridSearchCV(gbr, param_grid, cv=3, n_jobs=-1)
Next we fit the dataset to the model.
gbr2.fit(X_train, y_train)

gbr2.best_params_

gbr2.best_score_

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.