Table of Contents

In the realm of machine learning and data science, preparing your data is often as crucial as the modeling itself. One of the essential preprocessing steps when working with categorical data is one-hot encoding.

This technique transforms categorical variables into a format that can be provided to machine learning algorithms to improve predictions and insights.

One-hot encoding is a method to convert these categorical labels into a binary matrix, where each column represents a unique category, and the presence of a category is marked by a 1, while all other columns are marked by 0.

One Hot Encoding with Pandas and scikit-learn

Start off by importing in Pandas and One Hot Encoder

				
					
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
				
			

The code below will be used for our example of data that needs to be encoded. It’s a dictionary that we are going to convert into a dataframe through Pandas.

				
					
d = {'sales': [100000,222000,1000000,522000,111111,222222,1111111,20000,75000,90000,1000000,10000],
'city': ['Tampa','Tampa','Orlando','Jacksonville','Miami','Jacksonville','Miami','Miami','Orlando','Orlando','Orlando','Orlando'],
'size': ['Small', 'Medium','Large','Large','Small','Medium','Large','Small','Medium','Medium','Medium','Small',]}

				
			

Create the df by using the following code. Next, let’s take a look at the first 5 rows.

				
					
df = pd.DataFrame(data=d)
df.head()

				
			

The goal in this article is to One Hot Encode the values for the size column. 

The next line of code is creating and configuring an instance of the OneHotEncoder class from the sklearn.preprocessing module in scikit-learn

				
					
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False).set_output(transform="pandas")

				
			

After creating the instance, apply it to the DataFrame (df), specifically to the ‘size’ column. This is done through the fit and transform operations. 

				
					ohetransform = ohe.fit_transform(df[['size']])
				
			

If you print out the new ohetransform dataframe, this is the results you will see.

It’s not super helpful so we still need to attatch it back to the original dataframe df.

To create our final dataframe, let’s concat our original dataframe df and the new one we just created ohetransform. Since we just one hot encoded the size column, we can drop this from the new dataframe.

				
					df = pd.concat([df,ohetransform],axis=1).drop(columns=["size"])
				
			

Let’s print out the new data frame: print(df)

Free Community

Join 1,000+ AI Automation Builders

Weekly tutorials, live calls & direct access to Ryan & Matt.

Join Free →

Keep Learning

Kaggle House price prediction Regression Analysis

train_df = train_df.drop(columns=['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'GarageYrBlt', 'GarageCond', 'BsmtFinType2']) test_df = test_df.drop(columns=['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'GarageYrBlt', 'GarageCond', 'BsmtFinType2']) #drop GarageArea or GarageCars...

kaggle titanic tutorial

https://www.kaggle.com/code/ryannolan1/titanic-voting-classifier-0-78947?scriptVersionId=149342442&cellId=2https://www.kaggle.com/code/ryannolan1/titanic-voting-classifier-0-78947?scriptVersionId=149342442&cellId=2 #military - Capt, Col, Major #noble - Jonkheer, the Countess, Don, Lady, Sir #unmaried Female - Mlle, Ms, Mme #NEW Drop...

hyperparameter tuning with scikit learn

We would be looking at tuning hyperparameters with Scikit-Learn. Scikit-Learn is a powerful machine learning library for Python. It provides simple ,...

principal component analysis scikit learn

PCA (Principal Component Analysis) in Python using Scikit-learn is a technique used to reduce the number of features in a dataset while...