scikit-learn

Simple Imputer

When working with data in Python, especially using pandas, handling missing values is a crucial step in data cleaning. Missing values can occur in both categorical and numeric columns. There are several common strategies to address them: you can choose to ignore them (though this is rarely recommended), remove the rows that contain them using dropna(), or replace them with meaningful values. For numeric data, it’s common to use the column mean or median as a replacement, while for categorical data, you might use the most frequent value (mode) or a placeholder like “Unknown”. Selecting the right approach depends on the nature of the data and the goals of your analysis.

Let’s start by importing Pandas as pd.

And also importing numpy as np.

				
					import pandas as pd

				
					import numpy as np

Here we have a pandas DataFrame with a column named farthest_run_mi that contains some running distances, including a missing value, np.nan.

				
					miles = pd.DataFrame({'farthest_run_mi' :[50,62,np.nan,100,26,13,31,50]})

				
					miles

Next we check for the number of mssing values.

				
					miles.isna().sum()

Here we import the SimpleImputer as a convenient tool to handle missing data.

				
					from sklearn.impute import SimpleImputer

Next we strategy as mean and what this does is that it will replace mssing numeric values with a mean of each column.

				
					imp_mean = SimpleImputer(strategy='mean')

				
					imp_mean.fit_transform(miles)

Here we set the strategy as ‘median’. we are instructing the imputer to fill in missing values with the median of the column.

				
					imp_median = SimpleImputer(strategy='median')

				
					imp_median.fit_transform(miles)

Here, we replace the mssing value with the most frequent vale.

				
					imp_most_frequent = SimpleImputer(strategy='most_frequent')

				
					imp_most_frequent.fit_transform(miles)

Here we are instructing the imputer to fill all missing values with a constant value, in this case 13.

				
					imp_constant = SimpleImputer(strategy='constant', fill_value = 13)

				
					imp_constant.fit_transform(miles)

Here we create another pandas DataFrame.

				
					names = pd.DataFrame({'names' :['ryan', 'nolan', 'honus', 'wagner', np.nan, 'ruth']})

				
					names

Next we tell the imputer to fill the missing value with ‘missing_name’

				
					mp_constant_cat = SimpleImputer(strategy='constant', fill_value = 'missing_name')

				
					imp_constant_cat.fit_transform(names)

This creates an imputer that not only fills in missing values using the mean, but also adds an extra column indicating where the missing values originally occurred.

				
					imp_add_indicator = SimpleImputer(strategy='mean', add_indicator = True)

Please note, here we are using the miles data

				
					imp_add_indicator.fit_transform(miles)

Here’s a more advanced concept

We read the dataset

				
					df = pd.read_csv('simple_imputer_csv.csv')

we import the make_column_transformer which helps us to apply differenct preprocessing steps to specific columns in the dataset.

				
					from sklearn.compose import make_column_transformer

				
					ct = make_column_transformer(
    (imp_constant_cat, ['Name']),
    (imp_mean, ['farthest_run_mi']),
    remainder='drop')

				
					ct.set_output(transform="pandas")

				
					df_pandas = ct.fit_transform(df)

Free Community

Join 1,000+ AI Automation Builders

Weekly tutorials, live calls & direct access to Ryan & Matt.

Join Free →

Ryan Nolan

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Simple Imputer

Table of Contents

Join 1,000+ AI Automation Builders

Ryan Nolan

Important Links

LinkedIn

Social Media

Keep Learning

adaboost classifier

Kaggle House price prediction Regression Analysis

kaggle titanic tutorial

hyperparameter tuning with scikit learn

principal component analysis scikit learn