Python Pandas Data Cleaning

https://www.espncricinfo.com/records/highest-career-batting-average-282910

  import pandas as pd

Here, we read the CSV file names ‘CricketTestMatchData.csv’ into a DataFrame called df using the read_csv.

  df = pd.read_csv('CricketTestMatchData.csv')

Here, we check for missing null values in the DataFrame df.

It returns a Boolean result for each column.

It returns True if the colun has  any missing values and False if it doesn’t.

  df.isnull().any()

This line filters the DataFrame df to show only the rows where the ‘Balls_Faced’ column has missing values.

isna() function returns True for missing values,

In Python, True is equivalent to 1 and False to 0.

  df[df['Balls_Faced'].isna()==1]

Here, we replace all missing (NaN) values in the ‘Balls_Faced’ column with 0.

This ensures that the column has no blanks.

  df['Balls_Faced'] = df['Balls_Faced'].fillna(0)

Here we also fill nay missing(NaN) values in the ‘Batting_Strike_Rate’ column with 0.

  df['Batting_Strike_Rate'] = df['Batting_Strike_Rate'].fillna(0)

Here we filter the DataFrame df to show only the rows where the ‘Player’ column is equal to ‘CL Walcott (WI)’

  df[df['Player']=='CL Walcott (WI)']

Here we use .duplicated() to return a Boolean Series where True means a row is duplicate of a previous one, and False means otherwise.

  df.duplicated()

Here we show only the rows where the ‘Player’ column contains duplicated values.

  df[df['Player'].duplicated()==1]

Here, we filter the DataFrame df to show only the rows where the ‘Player’ column matches one of the four specified names :

'GA Headley (WI)', 'GS Sobers (WI)', 'JB Hobbs (ENG)', or 'V Kohli (IND)'

  df[df['Player'].isin(['GA Headley (WI)', 'GS Sobers (WI)', 'JB Hobbs (ENG)', 'V Kohli (IND)'])]

Then we also drop the duplicates.

  df = df.drop_duplicates()

Then we check if the listed players are in ‘Player column’.

  df[df['Player'].isin(['GA Headley (WI)', 'GS Sobers (WI)', 'JB Hobbs (ENG)', 'V Kohli (IND)'])]

This line splits the values in the ‘Span’ into two parts using the hyphen – as the seperator.

  df['Span'].str.split(pat = '-')

Here, we first split each value in the ‘Span’ column by the hyphen –Â

Then we select the second part (index 1), which represents the end year of the player’s career.

  df['Span'].str.split(pat = '-').str[1]

This line creates a new column called ‘Rookie_Year’, by extracting the first part (index 0) of the ‘Span’ column after splitting by the hypen –

  df['Rookie_Year'] = df['Span'].str.split(pat = '-').str[0]

This line creates a new column called ‘Rookie_Year’, by extracting the second part (index 0) of the ‘Span’ column after splitting by the hypen –

  df['Final_Year'] = df['Span'].str.split(pat = '-').str[1]
Drop Span Columns

Here we remove the ‘Span’ column from the DataFrame df using the drop method with axis=1, which specifies that a column (not a row) is being dropped.

  df = df.drop(['Span'], axis = 1)
  df.head()
QUESTION -> Split up the Country from the player
  df['Player'].str.split(pat = '(')
  df['Country'] = df['Player'].str.split(pat = '(').str[1]
  df['Country'] = df['Country'].str.split(pat = ')').str[0]
  df['Country']
  df['Player'] = df['Player'].str.split(pat = '(').str[0]

We use the .dtypes to check the data types of each column.

  df.dtypes

Here we splt the ‘Highest_Inns_Score’ column by the asterisk (“*”) and extract the first part (before the asterisk).

  df['Highest_Inns_Score'].str.split(pat = '*').str[0]
  df['Highest_Inns_Score'] = df['Highest_Inns_Score'].str.split(pat = '*').str[0]

Here we convert the ‘Highest_Inns_Score’ column from string type to integer type using .astype(‘int’)

  df['Highest_Inns_Score'] = df['Highest_Inns_Score'].astype('int')
  df.dtypes
QUESTION SEE IF YOU CAN FIX Rookie Year and Final Year

Here, we convert multiple columns to int.

  df = df.astype({'Rookie_Year':'int', 'Final_Year':'int'})
  df.dtypes
  df.head()

Here, we clean the ‘Matches’ column by removing asterisks if present, splitting the string at * and keeping only the first part (before the *)

  df['Matches'] = df['Matches'].str.split(pat = '*').str[0]

Here, we convert the ‘Matches’ column to type int.

  df['Matches'] = df['Matches'].astype('int')
  df.dtypes

Here, we clean the ‘Balls_Faced’ column by splitting each value at the + symbol and keeping only the part before it.

  df['Balls_Faced'] = df['Balls_Faced'].str.split(pat = '+').str[0]

Here, we check for null values in the df.

  df.isnull().any()

Here we also check if any null values is present.

As stated earlier on, isna()==1 is same as True in python.

  df[df['Balls_Faced'].isna()==1]

Next we fill the null values with 0.

  df['Balls_Faced'] = df['Balls_Faced'].fillna(0)

Then we check for na once again.

  df[df['Balls_Faced'].isna()==1]

Here we check for ‘Players’ eqauls to ‘ED Weeks’

  df[df['Player']=='ED Weekes ']

Then we check for any null values.

  df.isnull().any()

Here, we tell panda to display all rows of a DataFrame when we print it, by setting the ‘display.max_rows’ option to None.

  pd.set_option('display.max_rows', None)
  df

Here we remove the row at index 56 from the DataFrame df using the drop method with axis=0, which specifies row deletion.

REMOVE FS Jackson Row
  df = df.drop(56, axis = 0)
  df
  df.isnull().any()
  df.dtypes
  df['Balls_Faced'] = df['Balls_Faced'].astype('int')

This is converting the ‘Batting_Strike_Rate’ to float.

  df['Batting_Strike_Rate'] = df['Batting_Strike_Rate'].astype('float')
  df.dtypes
QUESTION Build out Career_length Column
Â
hint convert rookie/final year and subtract

This line creates a new column called ‘career_length’ by subtracting the ‘Rookie_Year’ from the ‘Final_Year’

  df['career_length'] = df['Final_Year'] - df['Rookie_Year']
  df
Calculations with the dataframe
Question 1 -> Cricketers in this DF what is the Average Career Length
  df['career_length'].mean()
Question 2 AVG Batting_Strike_Rate for cricketers who played over 10 years
  df[df['career_length'] > 10]['Batting_Strike_Rate'].mean()
Question 3 find number of cricketers who played before 1960
  df[df['Rookie_Year'] < 1960]['Player'].count()
Question 4 Max Highest Inns Score by country
  df.groupby('Country')['Highest_Inns_Score'].max().to_frame('Highinncountry').reset_index().sort_values('Highinncountry', ascending = False)
Question 5 Hundreds, Fifties, ducks (0) AVG by Country
  df.groupby('Country')[['100', '50', '0']].mean()

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Leave a Reply

Your email address will not be published. Required fields are marked *