Python Pandas Data Cleaning
https://www.espncricinfo.com/records/highest-career-batting-average-282910
import pandas as pd
Here, we read the CSV file names ‘CricketTestMatchData.csv’ into a DataFrame called df using the read_csv.
df = pd.read_csv('CricketTestMatchData.csv')

Here, we check for missing null values in the DataFrame df.
It returns a Boolean result for each column.
It returns True if the colun has any missing values and False if it doesn’t.
df.isnull().any()

This line filters the DataFrame df to show only the rows where the ‘Balls_Faced’ column has missing values.
isna() function returns True for missing values,
In Python, True is equivalent to 1 and False to 0.
df[df['Balls_Faced'].isna()==1]

Here, we replace all missing (NaN) values in the ‘Balls_Faced’ column with 0.
This ensures that the column has no blanks.
df['Balls_Faced'] = df['Balls_Faced'].fillna(0)
Here we also fill nay missing(NaN) values in the ‘Batting_Strike_Rate’ column with 0.
df['Batting_Strike_Rate'] = df['Batting_Strike_Rate'].fillna(0)
Here we filter the DataFrame df to show only the rows where the ‘Player’ column is equal to ‘CL Walcott (WI)’
df[df['Player']=='CL Walcott (WI)']

Here we use .duplicated() to return a Boolean Series where True means a row is duplicate of a previous one, and False means otherwise.
df.duplicated()

Here we show only the rows where the ‘Player’ column contains duplicated values.
df[df['Player'].duplicated()==1]

Here, we filter the DataFrame df to show only the rows where the ‘Player’ column matches one of the four specified names :
'GA Headley (WI)'
, 'GS Sobers (WI)'
, 'JB Hobbs (ENG)'
, or 'V Kohli (IND)'
df[df['Player'].isin(['GA Headley (WI)', 'GS Sobers (WI)', 'JB Hobbs (ENG)', 'V Kohli (IND)'])]

Then we also drop the duplicates.
df = df.drop_duplicates()
Then we check if the listed players are in ‘Player column’.
df[df['Player'].isin(['GA Headley (WI)', 'GS Sobers (WI)', 'JB Hobbs (ENG)', 'V Kohli (IND)'])]

This line splits the values in the ‘Span’ into two parts using the hyphen – as the seperator.
df['Span'].str.split(pat = '-')

Here, we first split each value in the ‘Span’ column by the hyphen –Â
Then we select the second part (index 1), which represents the end year of the player’s career.
df['Span'].str.split(pat = '-').str[1]

This line creates a new column called ‘Rookie_Year’, by extracting the first part (index 0) of the ‘Span’ column after splitting by the hypen –
df['Rookie_Year'] = df['Span'].str.split(pat = '-').str[0]
This line creates a new column called ‘Rookie_Year’, by extracting the second part (index 0) of the ‘Span’ column after splitting by the hypen –
df['Final_Year'] = df['Span'].str.split(pat = '-').str[1]

Here we remove the ‘Span’ column from the DataFrame df using the drop method with axis=1, which specifies that a column (not a row) is being dropped.
df = df.drop(['Span'], axis = 1)
df.head()

df['Player'].str.split(pat = '(')

df['Country'] = df['Player'].str.split(pat = '(').str[1]
df['Country'] = df['Country'].str.split(pat = ')').str[0]
df['Country']

df['Player'] = df['Player'].str.split(pat = '(').str[0]

We use the .dtypes to check the data types of each column.
df.dtypes

Here we splt the ‘Highest_Inns_Score’ column by the asterisk (“*”) and extract the first part (before the asterisk).
df['Highest_Inns_Score'].str.split(pat = '*').str[0]

df['Highest_Inns_Score'] = df['Highest_Inns_Score'].str.split(pat = '*').str[0]
Here we convert the ‘Highest_Inns_Score’ column from string type to integer type using .astype(‘int’)
df['Highest_Inns_Score'] = df['Highest_Inns_Score'].astype('int')
df.dtypes

Here, we convert multiple columns to int.
df = df.astype({'Rookie_Year':'int', 'Final_Year':'int'})
df.dtypes

df.head()

Here, we clean the ‘Matches’ column by removing asterisks if present, splitting the string at * and keeping only the first part (before the *)
df['Matches'] = df['Matches'].str.split(pat = '*').str[0]
Here, we convert the ‘Matches’ column to type int.
df['Matches'] = df['Matches'].astype('int')
df.dtypes

Here, we clean the ‘Balls_Faced’ column by splitting each value at the + symbol and keeping only the part before it.
df['Balls_Faced'] = df['Balls_Faced'].str.split(pat = '+').str[0]
Here, we check for null values in the df.
df.isnull().any()

Here we also check if any null values is present.
As stated earlier on, isna()==1 is same as True in python.
df[df['Balls_Faced'].isna()==1]

Next we fill the null values with 0.
df['Balls_Faced'] = df['Balls_Faced'].fillna(0)
Then we check for na once again.
df[df['Balls_Faced'].isna()==1]

Here we check for ‘Players’ eqauls to ‘ED Weeks’
df[df['Player']=='ED Weekes ']

Then we check for any null values.
df.isnull().any()

Here, we tell panda to display all rows of a DataFrame when we print it, by setting the ‘display.max_rows’ option to None.
pd.set_option('display.max_rows', None)
df

Here we remove the row at index 56 from the DataFrame df using the drop method with axis=0, which specifies row deletion.
df = df.drop(56, axis = 0)
df

df.isnull().any()

df.dtypes

df['Balls_Faced'] = df['Balls_Faced'].astype('int')
This is converting the ‘Batting_Strike_Rate’ to float.
df['Batting_Strike_Rate'] = df['Batting_Strike_Rate'].astype('float')
df.dtypes

This line creates a new column called ‘career_length’ by subtracting the ‘Rookie_Year’ from the ‘Final_Year’
df['career_length'] = df['Final_Year'] - df['Rookie_Year']
df

df['career_length'].mean()

df[df['career_length'] > 10]['Batting_Strike_Rate'].mean()

df[df['Rookie_Year'] < 1960]['Player'].count()

df.groupby('Country')['Highest_Inns_Score'].max().to_frame('Highinncountry').reset_index().sort_values('Highinncountry', ascending = False)

df.groupby('Country')[['100', '50', '0']].mean()

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.