Python Pandas Data Cleaning

July 19, 2025 Ryan Nolan No comments yet

https://www.espncricinfo.com/records/highest-career-batting-average-282910

				
					import pandas as pd

Here, we read the CSV file names ‘CricketTestMatchData.csv’ into a DataFrame called df using the read_csv.

				
					df = pd.read_csv('CricketTestMatchData.csv')

Here, we check for missing null values in the DataFrame df.

It returns a Boolean result for each column.

It returns True if the colun has any missing values and False if it doesn’t.

				
					df.isnull().any()

This line filters the DataFrame df to show only the rows where the ‘Balls_Faced’ column has missing values.

isna() function returns True for missing values,

In Python, True is equivalent to 1 and False to 0.

				
					df[df['Balls_Faced'].isna()==1]

Here, we replace all missing (NaN) values in the ‘Balls_Faced’ column with 0.

This ensures that the column has no blanks.

				
					df['Balls_Faced'] = df['Balls_Faced'].fillna(0)

Here we also fill nay missing(NaN) values in the ‘Batting_Strike_Rate’ column with 0.

				
					df['Batting_Strike_Rate'] = df['Batting_Strike_Rate'].fillna(0)

Here we filter the DataFrame df to show only the rows where the ‘Player’ column is equal to ‘CL Walcott (WI)’

				
					df[df['Player']=='CL Walcott (WI)']

Here we use .duplicated() to return a Boolean Series where True means a row is duplicate of a previous one, and False means otherwise.

				
					df.duplicated()

Here we show only the rows where the ‘Player’ column contains duplicated values.

				
					df[df['Player'].duplicated()==1]

Here, we filter the DataFrame df to show only the rows where the ‘Player’ column matches one of the four specified names :

'GA Headley (WI)', 'GS Sobers (WI)', 'JB Hobbs (ENG)', or 'V Kohli (IND)'

				
					df[df['Player'].isin(['GA Headley (WI)', 'GS Sobers (WI)', 'JB Hobbs (ENG)', 'V Kohli (IND)'])]

Then we also drop the duplicates.

				
					df = df.drop_duplicates()

Then we check if the listed players are in ‘Player column’.

				
					df[df['Player'].isin(['GA Headley (WI)', 'GS Sobers (WI)', 'JB Hobbs (ENG)', 'V Kohli (IND)'])]

This line splits the values in the ‘Span’ into two parts using the hyphen – as the seperator.

				
					df['Span'].str.split(pat = '-')

Here, we first split each value in the ‘Span’ column by the hyphen –

Then we select the second part (index 1), which represents the end year of the player’s career.

				
					df['Span'].str.split(pat = '-').str[1]

This line creates a new column called ‘Rookie_Year’, by extracting the first part (index 0) of the ‘Span’ column after splitting by the hypen –

				
					df['Rookie_Year'] = df['Span'].str.split(pat = '-').str[0]

This line creates a new column called ‘Rookie_Year’, by extracting the second part (index 0) of the ‘Span’ column after splitting by the hypen –

				
					df['Final_Year'] = df['Span'].str.split(pat = '-').str[1]

Drop Span Columns

Here we remove the ‘Span’ column from the DataFrame df using the drop method with axis=1, which specifies that a column (not a row) is being dropped.

				
					df = df.drop(['Span'], axis = 1)

				
					df.head()

QUESTION -> Split up the Country from the player

				
					df['Player'].str.split(pat = '(')

				
					df['Country'] = df['Player'].str.split(pat = '(').str[1]

				
					df['Country'] = df['Country'].str.split(pat = ')').str[0]

				
					df['Country']

				
					df['Player'] = df['Player'].str.split(pat = '(').str[0]

We use the .dtypes to check the data types of each column.

				
					df.dtypes

Here we splt the ‘Highest_Inns_Score’ column by the asterisk (“*”) and extract the first part (before the asterisk).

				
					df['Highest_Inns_Score'].str.split(pat = '*').str[0]

				
					df['Highest_Inns_Score'] = df['Highest_Inns_Score'].str.split(pat = '*').str[0]

Here we convert the ‘Highest_Inns_Score’ column from string type to integer type using .astype(‘int’)

				
					df['Highest_Inns_Score'] = df['Highest_Inns_Score'].astype('int')

				
					df.dtypes

QUESTION SEE IF YOU CAN FIX Rookie Year and Final Year

Here, we convert multiple columns to int.

				
					df = df.astype({'Rookie_Year':'int', 'Final_Year':'int'})

				
					df.dtypes

				
					df.head()

Here, we clean the ‘Matches’ column by removing asterisks if present, splitting the string at * and keeping only the first part (before the *)

				
					df['Matches'] = df['Matches'].str.split(pat = '*').str[0]

Here, we convert the ‘Matches’ column to type int.

				
					df['Matches'] = df['Matches'].astype('int')

				
					df.dtypes

Here, we clean the ‘Balls_Faced’ column by splitting each value at the + symbol and keeping only the part before it.

				
					df['Balls_Faced'] = df['Balls_Faced'].str.split(pat = '+').str[0]

Here, we check for null values in the df.

				
					df.isnull().any()

Here we also check if any null values is present.

As stated earlier on, isna()==1 is same as True in python.

				
					df[df['Balls_Faced'].isna()==1]

Next we fill the null values with 0.

				
					df['Balls_Faced'] = df['Balls_Faced'].fillna(0)

Then we check for na once again.

				
					df[df['Balls_Faced'].isna()==1]

Here we check for ‘Players’ eqauls to ‘ED Weeks’

				
					df[df['Player']=='ED Weekes ']

Then we check for any null values.

				
					df.isnull().any()

Here, we tell panda to display all rows of a DataFrame when we print it, by setting the ‘display.max_rows’ option to None.

				
					pd.set_option('display.max_rows', None)

df

Here we remove the row at index 56 from the DataFrame df using the drop method with axis=0, which specifies row deletion.

REMOVE FS Jackson Row

				
					df = df.drop(56, axis = 0)

df

				
					df.isnull().any()

				
					df.dtypes

				
					df['Balls_Faced'] = df['Balls_Faced'].astype('int')

This is converting the ‘Batting_Strike_Rate’ to float.

				
					df['Batting_Strike_Rate'] = df['Batting_Strike_Rate'].astype('float')

				
					df.dtypes

QUESTION Build out Career_length Column

hint convert rookie/final year and subtract

This line creates a new column called ‘career_length’ by subtracting the ‘Rookie_Year’ from the ‘Final_Year’

				
					df['career_length'] = df['Final_Year'] - df['Rookie_Year']

df

Calculations with the dataframe

Question 1 -> Cricketers in this DF what is the Average Career Length

				
					df['career_length'].mean()

Question 2 AVG Batting_Strike_Rate for cricketers who played over 10 years

				
					df[df['career_length'] > 10]['Batting_Strike_Rate'].mean()

Question 3 find number of cricketers who played before 1960

				
					df[df['Rookie_Year'] < 1960]['Player'].count()

Question 4 Max Highest Inns Score by country

				
					df.groupby('Country')['Highest_Inns_Score'].max().to_frame('Highinncountry').reset_index().sort_values('Highinncountry', ascending = False)

Question 5 Hundreds, Fifties, ducks (0) AVG by Country

				
					df.groupby('Country')[['100', '50', '0']].mean()

Ryan Nolan

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Python Pandas Data Cleaning

Ryan Nolan

Leave a Reply Cancel reply

Important Links

LinkedIn

Get in touch