Pandas MultiIndex

We are going to be lloking at Pandas Multiindex

to start with, we would import pandas as pd

  import pandas as pd

Example 1

In this example, we create a Python dictionary “data_races, that stores race information, including race names, the year they occured, and a difficulty rating for each event

  data_races = { 'Race': ['Badwater', 'Barkley Marathons', 'Vero Beach Ultra', 'Forgotten Florida', 'Badwater'], 'Year': [2020, 2021, 2020, 2021, 2025], 'Difficulty': [9.7, 9.8, 8.1, 6.1, 9.9] }

Next, we convert the data_races dictionary into a pandas DataFrame named “df”.

  df = pd.DataFrame(data_races)

Then we set a multi-level index on the DataFrame using both “Race” and “Year”. This allows for more advanced selection and organization based on these two columns

  df = df.set_index(['Race', 'Year'])
  df

Here, we create a seperate copy of the df DataFrame and store it in the df2, allowing changes to be made to df2 without affecting the original df.

  df2 = df.copy()

Here we perform similar operation but we store the df DataFrame in “df3”

  df3 = df.copy()

Example 2

This returns the index of the df DataFrame, which is now a multiindex consisting of Race and Year pairs for each row

  df.index

Example 3 Index Levels values

Next we retrieve all values from the first level of the multiindex. i.e the Race names in the df DataFrame

  df.index.get_level_values(0)

This retrieves all the values from the second level of the Multiindex i.e the Year values in the df DataFrame

  df.index.get_level_values(1)

Example 4 loc

In this example , we use the “loc” to select all rows in the df DataFrame where the Race level of the index is “Badwater”, returning entries for all corresponding years.

  df.loc['Badwater'] # returns all years for Badwater

Here we access the specific row in df where the Race is “Badwater” and the Year is 2020, returning the corresponding data for that entry

  df.loc[('Badwater', 2020)] # returns a specific entry

Example 5 iloc

Note: .iloc[] doesn’t care about the index labels—just the row numbers.

Here we return the first row of the DataFrame based on it’s position, which is “Badwater” in the year 2020

  df.iloc[0] # First row (Badwater, 2020)

Example 6 rename index version 1

  df.index.names = ['Event', 'YearHeld']

This renames the MultiIndex levels in the DataFrame to 'Event' and 'YearHeld', making the index labels more descriptive.

Example 7 rename index version 2

df.rename_axis(index={‘Event’: ‘Race’, ‘YearHeld’: ‘Year’}, inplace=True)

  df.rename_axis(index={'Event': 'Race', 'YearHeld': 'Year'}, inplace=True)

Example 8 Swap Index Levels

Here, we swap the order of the MultiIndex levels in df, making Year the first level and Eventthe second, and stores the result in df_swapped

  df_swapped = df.swaplevel()

Example 9 Aggregation with Multilevel index

Here we group the DataFrame by the Race level of the Multiindex and calculate the mean of the numeric columns i.e Difficulty for each race

  df.groupby(level='Race').mean()

This groups the DataFrame by the Year level of the Multiindex and computes the average of numeric columns like “Difficulty” for each year.

  df.groupby(level='Year').mean()

Example 10 sort by multilevel index - asc

Here we sort the DataFrame based on the Race level of the Mu;ltiindex in  alphabetical order, helping to organize the data by event name

  df.sort_index(level='Race')

Example 11 sort by multilevel index - desc

Here, we sort the DataFrame by the Year in descending order, showing the most recent years first

  df.sort_index(level='Year', ascending = False)

Example 12 sort by column

Here we sort the DataFrame by the Difficulty column in ascending order, from the easiest to the most difficult race.

  df.sort_values(by='Difficulty')

Example 13 Slicing Multiindex

This assigns a special object IndexSlice from pandas to the variable idx.
Â
It’s used for slicing multi-level indexes (also known as a MultiIndex) in a clean and readable way

Â

selects all values in the first level of the MultiIndex, and only 2020 in the second level.
  idx = pd.IndexSlice
  df.loc[idx[:, 2020], :] # All races from 2020

Example 14 Filtering

Same as normal

Here we filter the DataFrame to show only the rows where the Difficulty rating is greater than 8

  df[df['Difficulty'] > 8]

Example 15 Update Values at

Here we update the Difficulty value to 10.0 for the entry where the Race is “Badwater” and the year is “2020”

  df.at[('Badwater', 2020), 'Difficulty'] = 10.0

Example 16 Add New Row

This adds a new row to the DataFrame for the race “Everglades Ultra” in the year 
2020 with a Difficulty rating of 7.1

  df.loc[('Everglades Ultra', 2020), :] = [7.1]

Example 17 Cross-section: Query along a specific level

Here we retrieve all rows from the DataFrame where the Year level of the Multiindex is “2020”, filtering races held in that year

  # Get all races in 2020 df.xs(2020, level='Year')

This retrieves all rows where the Race is “Badwater” , returning data for all years that race was held

  #Get all years for "Badwater" df.xs('Badwater', level='Race')

Example 18 reorder index levels

This reorders the MultiIndex so that Year comes first and Race second, then sorts the DataFrame by this new index order.

  df = df.reorder_levels(['Year', 'Race']).sort_index()
  df

Example 19 - unstack- Moves index level(s) to columns

That moves the inner index level (YearHeld) into columns:
  df2.unstack(level='Race')

This changes the layout of the DataFrame by turning the Year from part of the index into column headers, showing the Difficulty for each race across different years.To start we’re going to create a simple dataframe in python:

  #Remember, starts at 0! df2.unstack(level=1)

Example 20 stack

Here we create a dictionary called “data_runners” that stores runners names along with their finish times for two race types

  data_runners = { 'Runner': ['Alice', 'Bob', 'Clara', 'Dave', 'Eva'], '50 miler': [7.5, 10.2, 8.9, 15.8, 6.3], '100 miler': [22.3, 35.6, 27.8, 39.1, 18.5] }

Next we create a DataFrame from “data_runners” and set the “Runner” column as the indexÂ

  df4 = pd.DataFrame(data_runners).set_index('Runner')

Here we reshape the DataFrame by stacking the columns into a single column, creating a multiindex with Runner and race type

  df_stacked = df4.stack()
This gives a Series with a MultiIndex:

Example 21 - Reset Index / convert MultiIndex into columns

This basically unstacks the df4 DataFrame and also resets the index

  df_stacked_reset = df4.stack().reset_index()

This resets the index of df3, turning the index levels  back into regular columns, and creates a new default integer index.

  df_reset_dataframe = df3.reset_index()

Example 22 MultiIndex columns

This creates a list called data_music containing two nested lists:
The first list holds album sales and their corresponding years.

The second list contains number of album released for the same periods.

  data_music = [ [200000000, 1960, 120000000, 1962, 250000000, 1965], [8, 10, 17, 7, 26, 12] ]

This creates a MultiIndex for columns using all combinations of band names and their attributes, like “Album Sales” and “Starting Year”, to organize data in a more structured format.

  # create multi-level column index bands = ['The Beatles', 'The Rolling Stones', 'Pink Floyd'] attributes = ['Album Sales', 'Starting Year'] multi_columns = pd.MultiIndex.from_product([bands, attributes], names=['Band', 'Attribute'])

This creates a DataFrame (df5) using the data_music values, sets custom row labels for the index, and applies the multi_columns MultiIndex to organize the columns by band and attribute.

  df5 = pd.DataFrame(data_music, index=['Worldwide Sales (est.)', 'Studio Albums'], columns=multi_columns)

Example 23 build multiindex from array

Here we create three arrays.

country: This holds country names

city: This lists corresponding cities

population: contains the population for each city

  # create arrays country = ['Netherlands', 'Belgium', 'Switzerland', 'Switzerland', 'France', 'United Kingdom'] city = ['Amsterdam', 'Brussels', 'Geneva', 'Zermatt', 'Paris', 'London'] population = [821752, 185103, 203856, 5771, 2148271, 8982000] # example populations

This combines the country and city lists into a single list of lists called index_array, which can be used to create a Multiindex for rows.

  # create array of arrays index_array = [country, city]

This creates a Multiindex for rows using the country and city arrays, assigning “Country” and “City” as the index level names.

  # create multiindex from array multi_index = pd.MultiIndex.from_arrays(index_array, names=['Country', 'City'])

Next we create a data frame “df6 using multi_index

  # create dataframe using multiindex df6 = pd.DataFrame({'Population': population}, index=multi_index)

Example 24 build 3 level multi index

This defines lists representing baseball data

  # create data years = [2022, 2022, 2022, 2023, 2023, 2023] leagues = ['AL', 'AL', 'NL', 'AL', 'AL', 'NL'] teams = ['Yankees', 'Astros', 'Dodgers', 'Yankees', 'Astros', 'Dodgers'] wins = [99, 106, 111, 82, 90, 100] losses = [63, 56, 51, 80, 72, 62]

This comines the years, leagues and teams lists into one list of lists called index_array, which will be used to build a multiindex for the DataFrame rows

  # create array of arrays index_array = [years, leagues, teams]

This creates a Multiindex for rows using the yeahrs, leagues, and teams arrays, and labels the index levels as “Year”, “League”, and “Team”.Â

  # create multiindex multi_index = pd.MultiIndex.from_arrays(index_array, names=['Year', 'League', 'Team'])

This creates a DataFrame “df7” with the wins and losses data , using the previously created multi_index as the row index to organize the data by year, league, and team

  # create dataframe df7 = pd.DataFrame({'Wins': wins, 'Losses': losses}, index=multi_index)

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Leave a Reply

Your email address will not be published. Required fields are marked *