Optuna Hyperparameter Tuning

June 27, 2024 Ryan Nolan No comments yet

Optuna is a hyperparameter optimization framework for machine learning models. It can help automate and streamline the process of tuning the hyperparameters.�

It’s quite popular among Kaggle users and you’ll see it used within competitions.

In this article, we will go over an example of using it on a basic dataset. There is also a YouTube video if you want to watch a video instead of reading an article.

Before we start, we are going to have to import a few things.

  import seaborn as sns import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score import optuna from sklearn.model_selection import cross_val_score import matplotlib.pyplot as plt

Now it’s time to create a dataframe. We are going to load the healthexp dataset from seaborn

  healthexp = sns.load_dataset('healthexp') healthexp.head(100)

Next, we need to convert our categorical data (Country) into a format that can used with machine learning models. We can do this with get_dummies in pandas.

  healthexp = pd.get_dummies(healthexp)

Now it’s time to split our data into X and Y dataframes. On the X side of things, drop the target which is Life Expectancy. Y set to Life Expectancy.

  X = healthexp.drop(['Life_Expectancy'], axis=1)

  y = healthexp['Life_Expectancy']

Now set up train test split with a test_size of 0.2. Feel free to choose a random_state of your liking.

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=19)

For this article, we are going to use a basic Random Forest Regressor.�

  rfr = RandomForestRegressor(random_state=13)

  rfr.fit(X_train, y_train)

  y_pred = rfr.predict(X_test)

Now that we fit the model and ran a prediction, let’s take a look at a few statistics. In particular mean absolute error, mean squared error, and the r2 score.

  mean_absolute_error(y_test, y_pred)

MAE: 0.25916363636361917

  mean_squared_error(y_test, y_pred)

MSE: 0.10221141818181628

  r2_score(y_test, y_pred)

R2: 0.9910457602615238

With these benchmarks, let’s take a look at using Optuna

Optuna Example

The first thing we need to do is define an objective function. Inside the function create the hyperparameters you want to test.

   def objective(trial): n_estimators = trial.suggest_int('n_estimators', 100, 1000) max_depth = trial.suggest_int('max_depth', 10, 50) min_samples_split = trial.suggest_int('min_samples_split', 2, 32) min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 32) model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf) score = cross_val_score(model, X, y, n_jobs=-1, cv=5, scoring='neg_mean_squared_error').mean()

Then we create a study. I’m going to use seed=42.

  study = optuna.create_study(direction='maximize', sampler=optuna.samplers.RandomSampler(seed=42)) # Default is random Search

Optimize your study, send in the parameters of objective and the amount of trials.

  study.optimize(objective, n_trials=100)

Next grab the best hyper parameters and values.

  best_params = study.best_params print(f"Best Hyperparameters: {best_params}")

Best Hyperparameters: {‘n_estimators’: 358, ‘max_depth’: 34, ‘min_samples_split’: 2, ‘min_samples_leaf’: 2}

  best_score = study.best_value print(f"Best Accuracy: {best_score:.3f}")

Best Accuracy: -1.860

Visualizing the Optuna Results

  optuna.visualization.plot_optimization_history(study)

  optuna.visualization.plot_parallel_coordinate(study)

  optuna.visualization.plot_slice(study, params=['n_estimators', 'max_depth', 'min_samples_leaf', 'min_samples_split'])

  optuna.visualization.plot_param_importances(study)

Lets grab each hyperparameter from the dictionary.

   best_n_estimators = best_params['n_estimators'] best_max_depth = best_params['max_depth'] best_min_samples_split = best_params['min_samples_split'] best_min_samples_leaf = best_params['min_samples_leaf']

Then lets rebuild our random forest regressor with the parameters from optuna.

   best_model = RandomForestRegressor(n_estimators=best_n_estimators, max_depth=best_max_depth, min_samples_split=best_min_samples_split, min_samples_leaf=best_min_samples_leaf) best_model.fit(X_train, y_train)

  y_pred = best_model.predict(X_test)

Let’s compare the results to what we had earlier

  mean_absolute_error(y_test, y_pred)

MAE: 0.3089099265527784

  mean_squared_error(y_test, y_pred)

MSE: 0.13917245887029073

  r2_score(y_test, y_pred)

R2: 0.9878077852368601

Ryan Nolan

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Optuna Hyperparameter Tuning

Optuna Example

Visualizing the Optuna Results

Ryan Nolan

Leave a Reply Cancel reply

Important Links

LinkedIn

Get in touch