BeautifulSoup4 extract table
import requests import pandas as pd from bs4 import BeautifulSoup
Basic Example HTML Code -> Runners
html = """ Personal Running Bests Personal Running Bests
Distance Time 5k 18:30 10k 37:50 Half Marathon 1:25:11 Marathon 3:17:00 50 Miler 9:14:30 100 Miler 32:11:11
"""
Extract headers
headers = [th.get_text(strip=True) for th in table.find_all("th")]
headers

Step 5: Extract table rows
rows = []
for tr in table.find_all("tr")[1:]: # Skip header row cells = [td.get_text(strip=True) for td in tr.find_all("td")] if cells: rows.append(cells)
# Step 6: Create a pandas DataFrame df = pd.DataFrame(rows, columns=headers)
df

Example 2 Metallica Ticket Sales
url = "https://en.wikipedia.org/wiki/WorldWired_Tour"
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
Only Grab the First Table We don't want that in this instance
table = soup.find("table")
table

Grab All Tables. We don't want that in this instance, we need to target a specific table
# or find all tables tables = soup.find_all("table")
tables

Target a specific table #TO DO -> 2017 Concert
table = None
for th in soup.find_all("th"): if "Date (2017)" in th.get_text(): table = th.find_parent("table") break
table

headers = [th.get_text(strip=True) for th in table.find('tr').find_all('th')]
for tr in table.find_all('tr')[1:]: # Skip header row cells = tr.find_all(['th', 'td']) row = [cell.get_text(strip=True).replace('\xa0', ' ') for cell in cells] rows.append(row)
df = pd.DataFrame(rows, columns=headers[:len(rows[0])]) # Avoid header mismatch
df

Cleaning up data, tons of ways we can start cleaning up date, some sites super easy and nice tables, this one isnt the best, should have multiple support columns etc
rename date column
df.rename(columns={'Date (2017)': 'date'}, inplace=True)
remove [] in date column
df['date'] = df['date'].str.replace(r'\[.*?\]', '', regex=True).str.strip()
df

forward fill to fix city, venue, and country issue
df['City'] = df['City'].ffill() df['Country'] = df['Country'].ffill() df['Venue'] = df['Venue'].ffill()
df

fix column names
df.columns = df.columns.str.strip()
df.columns = df.columns.str.replace(' ', '_')
df.columns = df.columns.str.lower()
df

Fix attendance issue
def is_attendance(val): if pd.isna(val): return False pattern = r'^\d{1,3}(?:,\d{3})? ?/ ?\d{1,3}(?:,\d{3})?$' return bool(pd.Series(val).str.contains(pattern, regex=True)[0])
df

Fix stadium issue
venue_keywords = r'\b(?:Arena|Center|Field|Stadium|Garden|Park|Speedway)\b'
mask = df['country'].str.contains(venue_keywords, case=False, na=False)
df.loc[mask, 'venue'] = df.loc[mask, 'country']
df.loc[mask, 'country'] = None # Clear them from 'country'
df

Export to CSV
df.to_csv("metallica.csv", index=False)
Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.