Pandas Columns

  import pandas as pd import numpy as np

Here we define a dataset containing records of eight runners.

Each entry includes:

The runners first and last name.

Thier finish times in hours for threee race types:

  • Marathon
  • 50-miler
  • 100-miler

The date of their last race and a running_clud_id showing which running club each runner belongs to.

  data = { 'Runner': ['Alice', 'Bob', 'Charlie', 'Diana', 'Ethan', 'Fiona', 'George', 'Hannah'], 'last_name': ['Morgan', 'Thompson', 'Reed', 'Santiago', 'Kim', 'Nguyen', 'Patel', 'Brooks'], 'Marathon_Time_hrs': [3.5, 4.0, 3.2, 4.5, 3.8, 4.1, 3.9, 4.3], 'Fifty_Miler_Time_hrs': [8.0, 9.5, 7.8, 10.2, 9.0, 10.0, 9.2, 10.5], 'Hundred_Miler_Time_hrs': [20.5, 23.0, 18.7, 25.2, 21.5, 24.1, 22.3, 26.0], 'last_race_date': ['04/27/2025', '05/24/2025', '12/13/2025', '04/27/2025', '05/24/2025', '12/13/2025', '04/27/2025', '05/24/2025'], 'running_club_id': [123, 345, 123, 678, 342, 123, 345, 123], }

Next, we create a tabular structure with pandas DataFrame.

We create this so we can perform data anaylsis and manipulation using pandas.

Â

  df = pd.DataFrame(data)
  df.head()

Example 1 Access Column Version 1

Here we access the Runner column using df[‘Runner’]

  df['Runner']

Example 2 Access Column Version 2

We can also access the Runners column by using df.Runner

  df.Runner

Example 3 access multiple columns

Here we get two columns ‘Runner’ and ‘Marathon_Time_hrs’.

Note the double square bracket, we use it when we want to access multiple columns

  df[['Runner', 'Marathon_Time_hrs']]

Example 4 get all columns Version 1

We use .columns to see all the columns in the DataFrame.

  df.columns

unlike .columns whihc returns the index,Â

.columns.values returns a numpy array of the column names

  df.columns.values

Example 5 get all columns Version 2 - list

Using the list() metthod is a quick way to get a list of all column names.

  list(df)
  list(df.columns)

Example 6 get all columns Version 3 - Sorted

we use the sorted() method to sort the column names and return them as a list.

It does not chnage the DataFrame, it only gives us the sorted list of column headers.

  sorted(df)
  sorted(df.columns)

Example 6 get all columns Version 4 - Loop

We use a for loop to iterate over df.columns so that we can display each colum name individually.

   for column_name in df.columns: print(column_name)

Example 7 get all columns Version 5 - keys

.keys() returns the DataFrame’s column labels,

.tolist() converts the keys into a regular Python list.

  df.keys().tolist()

Example 8 what if a dataframe has hundreds of columns - two ways to display max columns

Here we create a dataframe with 100 columns.

We use random.rand to generate a 2D array with the shape defined by num_rows and num_cols.

Each element is a random float between 0 and 1.

  num_rows = 10 num_cols = 100
  data = np.random.rand(num_rows, num_cols)

Here, we use list comprehension  to generate names like ‘col_1’, ‘col_2’ up to the number of columns (num_cols).

  column_names = [f'col_{i+1}' for i in range(num_cols)]

Next, we create a new DataFrame using the random data and assigns the generated column names.

  df_100 = pd.DataFrame(data, columns=column_names)
  df_100
pd.set_option(‘display.max_columns’, None) tells pandas to show all columns when printing a DataFrame.
  pd.set_option('display.max_columns', None)

pd.options.display.max_columns = None sets the same option as pd.set_option('display.max_columns', None), ensuring no columns get hidden in the output view. It’s just a different syntax to control display settings.

  pd.options.display.max_columns = None
  df_100

Example 9 column types

This returns a Series listing the tyoe (e.g int, float or object) for every column.Â

  df.dtypes

Example 10 convert column types - date time

.to_datetime() transforms string dates into panda’s datetime format.

  df['last_race_date'] = pd.to_datetime(df['last_race_date'])
  df.dtypes

Example 11 convert column types - numeric - TO DO

Here, using .astyoe(‘float’) we convert interger IDs into floating-point numbers.

  df['running_club_id'] = df['running_club_id'].astype('float')
  df.dtypes

Example 12 convert column types - string

Here we chnage the data type of ‘runnung_clud_id’ to text(strings).

  df['running_club_id'] = df['running_club_id'].astype(str)
  df.dtypes

Example 13 Rename Column

Here we use .rename() to change the column name ‘Marathon_Tine_hrs’ to ‘262_Timr_hrs’.

The inplace=True updates te DataFrame directly without needing to assign it to new variables.

  df.rename(columns={'Marathon_Time_hrs': '262_Time_hrs'}, inplace=True)

Example 14 Create new column from numbers

Here, we add up the hours from different columns and create a new one called ‘total_times’.

  df['total_times'] = df['262_Time_hrs'] + df['Fifty_Miler_Time_hrs'] + df['Hundred_Miler_Time_hrs']

Example 15 create new column from strings

Here, we combine ‘Runner’ and ‘last_name’ columns with a space in between.

The full_name column stores the complete name of each person.

  df['full_name'] = df['Runner'] + ' ' + df['last_name']

Example 16 create new column from new data

  shirt_sizes = ['S', 'M', 'M', 'L', 'M', 'S', 'L', 'XL']

Here we create a new column called ‘race_shirt_size’ using the created shirt_sizes data

  df['race_shirt_size'] = shirt_sizes

Example 17 drop column

Here we delete the specified column from the DataFrame.

  df.drop(columns=['race_shirt_size'], inplace=True)

Example 18 reorder columns

Here we reorder and selec specific columns from the DataFrame.

  df = df[['full_name', 'running_club_id', 'Hundred_Miler_Time_hrs', 'Fifty_Miler_Time_hrs', '262_Time_hrs']]

Example 19 - Set Index

Here we use the .set_index() to make the full_name column the new row labels(index).

  df.set_index("full_name")

Example 20 set multiple columns as index

Here we set a multi-level index using both ‘full_name’ and ‘running_club_id’

  df = df.set_index(['full_name', 'running_club_id'])

Example 21 - reset index

Here we use .reset_index() to restore ‘full_name’ and ‘running_club_id’ from the index to normal columns, making the DataFrame flat again.

  df = df.reset_index()
  df

Example 22 - Get only numeric columns

This extracts the names of all numeric columns in the DataFrame.

  numeric_columns = df._get_numeric_data().columns
  numeric_columns

Example 23 Describe

.describe() provide key stats likeÂ

mean, standard deviation, min, max and percentile.

  df.describe()

Example 24 Set Column Names post creation

Here we build a DataFrame from a list of lists where each inner list is a row.

  df2 = pd.DataFrame([['23:12', '22:11', '21:00', '20:30', '19:55', '18:55']])
  df2

Next we assign custom column names to the DataFrame df2.

  df2.columns = ['Race_1','Race_2','Race_3','Race_4','Race_5','Race_6']

SECTION ON CLEANING UP COLUMNS Values

The data includes runners’ names, temperatures, times, distances, and locations. Some entries use np.nan to represent missing or unknown values

  data3 = { 'Runner': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'Temperature (C)': [20, 18, np.nan, 22, 21], 'Time (min)': [25.5, 30.2, np.nan, 22.1, 28.4], 'Distance (km)': [5.0, np.nan, 5.0, 4.5, 5.0], 'Location': ['Park', 'Track', 'Trail', np.nan, 'Park'], }

Using pd.DataFrame(data3) we organize the raw dictionary data into a tabular format with rows and columns,

  df3 = pd.DataFrame(data3)

Example 25 Missing Values single column

This returns a boolean Series where “True” marks rows with missing time values, helping identify incomplete data.

  df3['Time (min)'].isna()

Example 26 all columns that have a null value

df3.isna().any() checks which columns contain any NaNs, and using df3.columns[...] selects their names. Finally, .tolist() converts the result into a regular Python list.

  df3.columns[df3.isna().any()].tolist()

Example 27 Fill Value

fillna(30) replaces all NaN entries with the number 30.

  df3['Time (min)'] = df3['Time (min)'].fillna(30)

Example 27 Drop Rows will null value in column

f3.dropna(subset=['Distance (km)']) drops all rows with NaN in the 'Distance (km)' column, helping clean the data by keeping only entries with valid distances.

  df3_drop = df3.dropna(subset=['Distance (km)'])

Example 28 drop the column if null value

dropna(axis=1) drops columns with at least one NaN.

  df3_drop_column = df3_drop.dropna(axis=1)
  df3_drop_column

Example 29 Insert

  fahrenheit = df3['Temperature (C)'] * 9/5 + 32

Here, we insert the new fahrenheit  column into the DataFrame

  df3.insert(2, 'Temperature (F)', fahrenheit)
  df3

SECTION ON CLEANUP COLuMN NAMES

  df4 = pd.DataFrame({ ' Runner Name ': ['Alex', 'Bri', 'Chris'], ' Total Time ': [45, 50, 42], 'Average Pace': [5.0, 5.5, 4.8], ' Calories Burned ': [300, 320, 310] })

Example 30 remove spaces on the end

Here, we remove any leading or trailing spaces from column names using .strip() method.

  df4.columns = df4.columns.str.strip()
  df4
  df4['Runner Name']

Example 31 remove spaces and replace with underscore

This replaces spaces in all column names with underscores (_), by using .replace()

  df4.columns = df4.columns.str.replace(' ', '_')
  df4

Example 32 make lowercase

Here we use the .lower() method to make the columns lower case.

  df4.columns = df4.columns.str.lower()

Example 33 Filter Column

Here we filter the DataFrame by checking the column (avaerage_pace) thats greater than 5

  df4[df4['average_pace'] > 5]

Example 34 Filter Column AND

The & operator combines both conditions

This filters df4 to show rows where:

  • 'calories_burned' is less than 350, and

  • 'total_time' is greater than 43.

  df4[(df4['calories_burned'] < 350) & (df4['total_time'] > 43)]

Example 35 Filter Column OR

This filters df4 to show rows where either condition is true:

  • 'calories_burned' is less than 350, or

  • 'total_time' is greater than 43.

The | operator means “OR”, so rows meeting at least one of the conditions will be included in the result.

  df4[(df4['calories_burned'] < 350) | (df4['total_time'] > 43)]

Example 36 loc

df4.loc[0, 'total_time'] uses .loc[] to access data by label/index, returning a single cell’s value.

This retrieves the value from the 'total_time' column in the first row (index 0).

  df4.loc[0, 'total_time']

Example 37 loc

This selects all rows and the columns from 'total_time' to 'calories_burned', inclusive.

Using df4.loc[:, 'total_time':'calories_burned'] slices the DataFrame to return only those columns in that range, across all rows

  df4.loc[:, 'total_time':'calories_burned']

Example 38 loc

This selects all rows and only the 'total_time' and 'calories_burned' columns from df4.

df4.loc[:, ['total_time','calories_burned']] uses label-based indexing to extract specific columns across the entire dataset

  df4.loc[:, ['total_time','calories_burned']]

Example 39 iloc

This gets the value at the first row and first column of df4

df4.iloc[0, 0] uses integer-based indexing to return a single cell’s value

  df4.iloc[0, 0]

Example 40 iloc

This selects a specific range of rows and columns by position.

df4.iloc[0:2, 1:3] returns rows 0 and 1 (Python excludes the end index) and columns at positions 1 and 2.

  df4.iloc[0:2, 1:3]

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Leave a Reply

Your email address will not be published. Required fields are marked *