Simple Imputer
When working with data in Python, especially using pandas, handling missing values is a crucial step in data cleaning. Missing values can occur in both categorical and numeric columns. There are several common strategies to address them: you can choose to ignore them (though this is rarely recommended), remove the rows that contain them using dropna()
, or replace them with meaningful values. For numeric data, it’s common to use the column mean or median as a replacement, while for categorical data, you might use the most frequent value (mode) or a placeholder like “Unknown”. Selecting the right approach depends on the nature of the data and the goals of your analysis.
Let’s start by importing Pandas as pd.
And also importing numpy as np.
import pandas as pd
import numpy as np
Here we have a pandas DataFrame with a column named farthest_run_mi that contains some running distances, including a missing value, np.nan.
miles = pd.DataFrame({'farthest_run_mi' :[50,62,np.nan,100,26,13,31,50]})
miles

Next we check for the number of mssing values.
miles.isna().sum()

Here we import the SimpleImputer as a convenient tool to handle missing data.
from sklearn.impute import SimpleImputer
Next we strategy as mean and what this does is that it will replace mssing numeric values with a mean of each column.
imp_mean = SimpleImputer(strategy='mean')
imp_mean.fit_transform(miles)

Here we set the strategy as ‘median’. we are instructing the imputer to fill in missing values with the median of the column.
imp_median = SimpleImputer(strategy='median')
imp_median.fit_transform(miles)

Here, we replace the mssing value with the most frequent vale.
imp_most_frequent = SimpleImputer(strategy='most_frequent')
imp_most_frequent.fit_transform(miles)

Here we are instructing the imputer to fill all missing values with a constant value, in this case 13.
imp_constant = SimpleImputer(strategy='constant', fill_value = 13)
imp_constant.fit_transform(miles)

Here we create another pandas DataFrame.
names = pd.DataFrame({'names' :['ryan', 'nolan', 'honus', 'wagner', np.nan, 'ruth']})
names

Next we tell the imputer to fill the missing value with ‘missing_name’
mp_constant_cat = SimpleImputer(strategy='constant', fill_value = 'missing_name')
imp_constant_cat.fit_transform(names)

This creates an imputer that not only fills in missing values using the mean, but also adds an extra column indicating where the missing values originally occurred.
imp_add_indicator = SimpleImputer(strategy='mean', add_indicator = True)
Please note, here we are using the miles data
imp_add_indicator.fit_transform(miles)

Here’s a more advanced concept
We read the dataset
df = pd.read_csv('simple_imputer_csv.csv')
we import the make_column_transformer which helps us to apply differenct preprocessing steps to specific columns in the dataset.
from sklearn.compose import make_column_transformer
ct = make_column_transformer( (imp_constant_cat, ['Name']), (imp_mean, ['farthest_run_mi']), remainder='drop')
ct.set_output(transform="pandas")
df_pandas = ct.fit_transform(df)
Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.