Web Scraping - Ryan & Matt Data Science

beautifulsoup pagination

Ryan Nolan — Sat, 19 Jul 2025 10:26:30 +0000

import requests – Allows us to make HTTP requests to web pages.
from bs4 import BeautifulSoup –It is used to parse and extract data from HTML content.
import pandas as pd – It is used for organizing and manipulating data in table format.
import re – It enables pattern matching using regular expressions.
from time import sleep – It lets us pause the script for a set amount of time.
import random – It is used to generate random values (e.g., for sleep delays).

				
					import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from time import sleep
import random

Example 1 - Setting up a timer

Here , we create a timer that waits for 3 seconds between each rounds of the loop

				
					
for i in range(5):
    print(f"Doing something... round {i+1}")
    sleep(3)   # wait 3 seconds

Example 2- setting up a random timer

Here we pick a random float between 1 and 5 seconds, then pause for the randomly chosen time.

				
					
for i in range(5):
    print(f"Doing something... round {i+1}")
    delay = random.uniform(1, 5)   # float seconds between 1 and 5
    print(f"Waiting {delay:.2f} seconds...")
    sleep(delay)

Example 3 string format url

Here, we loop through page number 1 to 5,

We then create a url with the current oage number.

Then we display the url

				
					
for page in range(1, 6):  # Pages 1 to 5
    url = f"https://example.com/page={page}"
    print(f"Fetching URL: {url}")

Example 4 Books to Scrape - static times

titles = [], prices = [] – Create empty lists to store book titles and prices.
page = 1 – Start scraping from page 1.
while page < 10: – Loop through pages 1 to 9.
url = f"http://books.toscrape.com/catalogue/page-{page}.html" – Format the URL for each page.
response = requests.get(url) – Fetch the page content.
if response.status_code == 404: break – Stop if the page doesn’t exist.
soup = BeautifulSoup(response.text, "html.parser") – Parse the HTML content.
articles = soup.find_all("article", class_="product_pod") – Get all book items on the page.
for article in articles: – Loop through each book item.
title = article.h3.a["title"] – Extract the book title.
price = article.find("p", class_="price_color").text.strip() – Extract and clean the book price.
titles.append(title), prices.append(price) – Save the title and price.
page += 1 – Go to the next page.
sleep(1) – Wait 1 second before the next request.

				
					
titles = []
prices = []

page = 1
while page < 10:
    url = f"http://books.toscrape.com/catalogue/page-{page}.html"

    response = requests.get(url)
    if response.status_code == 404:
        break

    soup = BeautifulSoup(response.text, "html.parser")
    articles = soup.find_all("article", class_="product_pod")

    for article in articles:
        title = article.h3.a["title"]
        price = article.find("p", class_="price_color").text.strip()
        titles.append(title)
        prices.append(price)

    page += 1
    sleep(1)  # polite delay

Here, we create a pandas DataFrame with column Title and Price populated with data gotten from our previous operation.

				
					df = pd.DataFrame({
    "Title": titles,
    "Price": prices
})

Example 5 Hockey Website - Tables and multiple pages and a function and random timer

In this function, we define scrape_pages(num_pages) to scrape multiple pages from a website with a paginated table. We start by setting the base URL template and creating empty lists to store column headers and row data. For each page from 1 to the number specified, we format the URL and make a GET request to fetch the HTML content. We parse the response using BeautifulSoup and look for a table with the class "table". If a table isn’t found, we print a message and skip to the next page. On the first page, we extract the table headers. For every page, we loop through each table row with the class "team", extract and clean the cell text, and append it to our rows list. To avoid overwhelming the server, we pause for a random interval between 1 and 5 seconds after each request. Once all pages are processed, we use the collected headers and rows to create a Pandas DataFrame and return it.

				
					
def scrape_pages(num_pages):
    base_url = "https://www.scrapethissite.com/pages/forms/?page={}"
    headers = []
    rows = []

    for page in range(1, num_pages + 1):
        url = base_url.format(page)
        response = requests.get(url)

        soup = BeautifulSoup(response.text, "html.parser")
        table = soup.find("table", {"class": "table"})
        if not table:
            print(f"No table found on page {page}")
            continue

        # Extract column headers on first page
        if page == 1:
            headers = [th.get_text(strip=True) for th in table.find_all("th")]

        # Extract rows
        for tr in table.find_all("tr", class_="team"):
            cells = [td.get_text(strip=True) for td in tr.find_all("td")]
            if cells:
                rows.append(cells)

        delay = random.uniform(1, 5)
        sleep(delay)  # polite delay to avoid overloading server

    # Build DataFrame
    df = pd.DataFrame(rows, columns=headers)
    return df

				
					df = scrape_pages(5)

Lets filter based off data scraped.

Example - customer only wants to see winning teams

Here, we create a new dataframe ‘df2’ by filtering df to include only the rows where the number of Wins is greater than the number of Lossess.

This helps us keep only the teams with a winning record.

				
					df2 = df[df['Wins'] > df['Losses']]

Finally we save dataframe.

				
					df2.to_csv("scrapethissite_forms_stats.csv", index=False)

The post beautifulsoup pagination first appeared on Ryan & Matt Data Science.

web scraping with python

Ryan Nolan — Thu, 05 Jun 2025 00:51:31 +0000

				
					import requests
from urllib.parse import urljoin
import urllib.robotparser

Part 1 Getting your first page

				
					def response_code(response):
      if response.status_code == 200:
        print("Page fetched successfully!")
      else:
        print("Failed to retrieve page:", response.status_code)

				
					URL = "http://books.toscrape.com/"

				
					url_response = requests.get(URL)

				
					response_code(url_response)

Failed to retrieve page: 403 | a client does not have the necessary permissions to access a specific web page or resource on a server

URL2 = “https://www.compreoalquile.com”

				
					URL2 = "https://www.baseball-reference.com/players/s/suzukic01.shtml"

				
					url_response_2 = requests.get(URL2)

				
					response_code(url_response_2)

Failed to retrieve page: 404 | page doesnt exist

				
					URL3 = "https://ryanandmattdatascience.com/100miles"

				
					url_response_3 = requests.get(URL3)

				
					response_code(url_response_3)

Example 2 Checking robots.txt file

				
					def check_robots(url):
        robots_url = urljoin(url, '/robots.txt')
        response = requests.get(robots_url)
        print(response.text)

				
					check_robots('https://www.amazon.com')

#Do not scrape disallowed paths — it’s unethical and can get you blocked

#User-agent: *

#Disallow: /private/

#This means scrapers should avoid https://www.compreoalquile.com/develop.

#Look for the following, next example will show what to look at also

#User-agent: *

#Crawl-delay: 10

#This means bots should wait 10 seconds between requests.

Example 3 Checking robots.txt file - Look For Delays

#Check rate limits

#If the site offers an official API, always use that first — and they usually include rate limits in the docs.

#Example (from GitHub API docs):

#”You can make up to 60 requests per hour for unauthenticated requests

				
					rp = urllib.robotparser.RobotFileParser()

				
					rp.set_url("https://www.amazon.com")

				
					delay = rp.crawl_delay("*")

				
					print(delay)  # Might return something like 10

Example 4

#Checking robots.txt file if you can scrape a page

#Use urllib.robotparser to respect rules

#Check for Disallow, Crawl-delay

#Handle 429 or 503 errors with backoff

#Be sure to add delays between pages.

				
					rp = urllib.robotparser.RobotFileParser()

				
					rp.set_url('https://www.amazon.com/robots.txt')

				
					rp.read()

				
					print(rp.can_fetch("*", "https://www.amazon.com/CELSIUS-Fitness-Energy-Standard-Variety/dp/B06X6J5266/?_encoding=UTF8&pd_rd_w=U0HVD&content-id=amzn1.sym.9d904e2e-b55a-4ad0-aa4c-f9665fbd0e0d&pf_rd_p=9d904e2e-b55a-4ad0-aa4c-f9665fbd0e0d&pf_rd_r=VXNM0SX2W4PCFA2KB7FJ&pd_rd_wg=KJlEC&pd_rd_r=8a75547a-24ff-4998-bdea-f875d0f05448&ref_=pd_hp_d_btf_crs_zg_bs_16310101"))

Example 5 Headers

In a later video we will go over rotating headers

				
					URL = "http://books.toscrape.com/"

				
					headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0 Safari/537.36'
}

				
					requests.get(URL, headers=headers)

				
					url = "https://ultrasignup.com/results_event.aspx?did=96529"

				
					headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
}

				
					# Fetch the page
response = requests.get(url, headers=headers)

				
					response

The post web scraping with python first appeared on Ryan & Matt Data Science.

BeautifulSoup4 find vs find_all

Ryan Nolan — Thu, 05 Jun 2025 00:39:06 +0000

				
					import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

				
					html = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Ultra Running Events</title>
</head>
<body class="site-body">
    <header class="site-header">
        <h1 class="site-title">Ultra Running Events</h1>
        <nav class="main-nav">
            <ul class="nav-list">
                <li><a class="nav-link" href="#races-50">50 Mile Races</a></li>
                <li><a class="nav-link" href="#races-100">100 Mile Races</a></li>
            </ul>
        </nav>
    </header>

    <section id="races-50" class="race-section race-50">
        <h2 class="section-title-50">50 Mile Races</h2>
        <ul class="race-list-50">
            <li class="race-item">
                <h3 class="race-name"><a href="https://www.ryandataraces.com/rocky-mountain-50">Rocky Mountain 50</a></h3>
                <p class="race-date highlighted">Date: August 10, 2025</p>
                <p class="race-location">Location: Boulder, Colorado</p>
            </li>
            <li class="race-item">
                <h3 class="race-name"><a href="https://www.ryandataraces.com/desert-dash-50">Desert Dash 50</a></h3>
                <p class="race-date">Date: September 14, 2025</p>
                <p class="race-location">Location: Moab, Utah</p>
            </li>
        </ul>
    </section>

    <section id="races-100" class="race-section race-100">
        <h2 class="section-title-100">100 Mile Races</h2>
        <ul class="race-list-100">
            <li class="race-item">
                <h3 class="race-name"><a href="https://www.ryandataraces.com/mountain-madness-100">Mountain Madness 100</a></h3>
                <p class="race-date">Date: July 5, 2025</p>
                <p class="race-location">Location: Lake Tahoe, California</p>
            </li>
            <li class="race-item">
                <h3 class="race-name"><a href="https://www.ryandataraces.com/endurance-beast-100">Endurance Beast 100</a></h3>
                <p class="race-date">Date: October 3, 2025</p>
                <p class="race-location">Location: Asheville, North Carolina</p>
            </li>
        </ul>
    </section>

    <footer class="site-footer">
        <p>&copy; 2025 Ultra Running Events</p>
    </footer>
</body>
</html>

"""

				
					URL = 'https://books.toscrape.com/'

Part 1 Parsing the HTML From your first page. snapshot of html at that time

				
					response = requests.get(URL)

				
					if response.status_code == 200:
    print("Page fetched successfully!")
else:
    print("Failed to retrieve page:", response.status_code)

				
					soup = BeautifulSoup(response.text, 'html.parser')

				
					soup_html = BeautifulSoup(html, 'html.parser')

				
					soup

Example 3 use soup.prettify()

				
					print(soup.prettify())

Example 4 - Grab Page title

				
					soup_html.title

Example 5 - Grab Page H2 (This only grabs the first one...)

				
					soup_html.h2

Example 6 - Grab Page title text

				
					soup_html.title.get_text()

				
					soup_html.h2['class']

#Next two examples take a look at find vs find all

#| Method | Returns | Use When |

#| ———— | —————————— | ————————————— |

#| `find()` | The **first** matching element | You want a single element |

#| `find_all()` | A **list** of all matches | You want to loop through multiple items |

Example 8 Find

find() only returns the first match — it doesn’t let you directly access the second, third, etc.

				
					soup_html.find('h2')

Example 10 Find with Class

				
					soup_html.find('h2', class_ = 'section-title-50').get_text()

				
					soup_html.find('h2', class_ = 'section-title-100').get_text()

Example 11 Find Chain Requests

				
					#Finds the first <li> (list item) element in the document
#From that <li> element, it then finds the first <a> (anchor) tag inside that <li>.
soup_html.find('li').find('a')

Example 12 Seperating out Chain Requests

				
					list_item = soup_html.find('li')

				
					list_item_a = list_item.find('a')

				
					list_item_a

				
					#Example 13 Find All Races
soup_html.find_all('h2')

Example 14 Find First or 2second race

				
					soup_html.find_all('h2')[0]

				
					soup_html.find_all('h2')[0].get_text()

				
					soup_html.find_all('h2')[1]

				
					soup_html.find_all('h2')[1].get_text()

Example 15 final all and print out the text

				
					race_types = soup_html.find_all('h2')

				
					for race in race_types:
        print(race.get_text())

Example 16 find all with a class race dates

				
					soup_html.find_all('p', class_ = 'race-date')

Example 17 find all with Either class

				
					soup_html.find_all("p", class_=["race-date", "race-location"])

Example 18 Find OR Attricbutes href, title, id, class, src, alt, type

				
					soup_html.find_all("p", attrs={"class": ["race-date", "race-location"]})

				
					soup_html.find_all("a", attrs={"href": ["#races-50", "#races-100"]})

				
					#Example 19 Search for Strings
soup_html.find_all("a", string='Mountain Madness 100')

				
					#Example 20 Search for Strings with regex
soup_html.find_all("a", string=re.compile('Madness'))

Example 21 Parent Sibling Child

				
					h3_races = soup_html.find_all("h3")

				
					h3_races

				
					for h3 in h3_races:
        print("Race Name:", h3.get_text())
    
        # Get next siblings that are <p> tags
        for sibling in h3.find_next_siblings('p'):
            print("  ", sibling.get_text())

working with a real site now instead of basic html.

https://books.toscrape.com/

				
					print(soup.prettify())

Example 22 Find all books on a page and print them out

				
					#searches for all <article> elements in the HTML that have the class "product_pod".
books = soup.find_all("article", class_="product_pod")

				
					#.h3: accesses the <h3> tag inside the article.
#.a: accesses the <a> tag inside the <h3>, which contains the link to the book's detail page.
#["title"]: extracts the title attribute of the <a> tag, which holds the title of the book.
for book in books:
    print(book.h3.a["title"])

Example 23 Grab multiple things at once

				
					for book in books:
        title = book.h3.a['title']
        price = book.find('p', class_='price_color').get_text()
        relative_url = book.h3.a['href']
        book_url = URL + relative_url
        print(f"Title: {title} | Price: {price} | ULR {book_url}")

Example 24 Save Multiple Things to a Dataframe

				
					data = []

				
					for book in books:
        title = book.h3.a['title']
        price = book.find('p', class_='price_color').text
        relative_url = book.h3.a['href']
        book_url = URL + relative_url
    
        data.append({
            'Title': title,
            'Price': price,
            'URL': book_url
        })

				
					df = pd.DataFrame(data)

df

Example 25 Clean Data Frame Price Colum

				
					df['price_clean'] = df['Price'].str.replace('Â£', '', regex=False).astype(float)

				
					#Convert GBP to USD (example rate: 1 GBP = 1.0737 USD) (CHECK THIS)
exchange_rate = 1.0737

				
					df['price_usd'] = df['price_clean'] * exchange_rate

				
					df_final = df[['Title', 'Price_usd', 'URL']]

				
					df_final

Example 26 Export as a CSV File

				
					df_final.to_csv('scrapped_book_data.csv')

				
					df_final.to_excel('scrapped_book_data.xlsx')

The post BeautifulSoup4 find vs find_all first appeared on Ryan & Matt Data Science.

beautifulsoup4 Selectors

Ryan Nolan — Thu, 05 Jun 2025 00:38:02 +0000

				
					import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

				
					html = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Ultra Running Events</title>
</head>
<body class="site-body">
    <header class="site-header">
        <h1 class="site-title">Ultra Running Events</h1>
        <nav class="main-nav">
            <ul class="nav-list">
                <li><a class="nav-link" href="#races-50">50 Mile Races</a></li>
                <li><a class="nav-link" href="#races-100">100 Mile Races</a></li>
            </ul>
        </nav>
    </header>

    <section id="races-50" class="race-section race-50">
        <h2 class="section-title-50">50 Mile Races</h2>
        <ul class="race-list-50">
            <li class="race-item">
                <h3 class="race-name"><a href="https://www.ryandataraces.com/rocky-mountain-50">Rocky Mountain 50</a></h3>
                <p class="race-date">Date: August 10, 2025</p>
                <p class="race-location">Location: Boulder, Colorado</p>
            </li>
            <li class="race-item">
                <h3 class="race-name"><a href="https://www.ryandataraces.com/desert-dash-50">Desert Dash 50</a></h3>
                <p class="race-date">Date: September 14, 2025</p>
                <p class="race-location">Location: Moab, Utah</p>
            </li>
        </ul>
    </section>

    <section id="races-100" class="race-section race-100">
        <h2 class="section-title-100">100 Mile Races</h2>
        <ul class="race-list-100">
            <li class="race-item">
                <h3 class="race-name"><a href="https://www.ryandataraces.com/mountain-madness-100">Mountain Madness 100</a></h3>
                <p class="race-date">Date: July 5, 2025</p>
                <p class="race-location">Location: Lake Tahoe, California</p>
            </li>
            <li class="race-item">
                <h3 class="race-name"><a href="https://www.ryandataraces.com/endurance-beast-100">Endurance Beast 100</a></h3>
                <p class="race-date">Date: October 3, 2025</p>
                <p class="race-location">Location: Asheville, North Carolina</p>
            </li>
        </ul>
    </section>

    <section id="important-notes">
        <h2>Important Notes</h2>
        <p><strong>All races start at 6:00 AM sharp.</strong></p>
        <p><strong>Mandatory pre-race check-in the evening before.</strong></p>
    </section>

    <footer class="site-footer">
        <p>&copy; 2025 Ultra Running Events</p>
    </footer>
</body>
</html>

"""

				
					soup_html = BeautifulSoup(html, 'html.parser')

				
					soup_html.select_one('h2')

				
					soup_html.select_one('h2').get_text()

				
					soup_html.select_one('strong').get_text()

				
					soup_html.select('p') #grabs all p tags

				
					soup_html.select('p')[0].get_text()

				
					soup_html.select('p')[1].get_text()

				
					all_p = soup_html.select('p') #grabs all p tags

				
					for p in all_p:
  print(p.get_text())

				
					soup_html.select('a')

				
					a_element = soup_html.select('a')

				
					for link in a_element:
  print(link['href'])

				
					a_element[2]['href']

				
					soup_html.select('#races-100')

				
					#Example 10 This selector finds any <a> descendant of an <h3> element with the class race-name
#Descendant" means any level deep inside the h3, not just direct children
soup_html.select('h3.race-name a')

				
					#Example 11 Direct Descendents
#Let’s say you want to extract the <a> tag directly under each <h3 class="race-name"> (but only if it's a direct child)
#It will not match nested links.
#Omit > when you want any level of nesting
soup_html.select('h3.race-name > a')

				
					#Example 12 After an Elements Siblings
#Paragraphs after h3 tag
#Have the same parent as an <h3>, and
#Appear after that <h3> in the HTML, regardless of how many elements are in between
#It's useful when you want to grab siblings after a specific element, but not necessarily immediately after
soup_html.select("h3 ~p")

				
					#Example 13  Element in one of two classes (e.g., .race-50 OR .race-100)
sections = soup_html.select('section.race-50, section.race-100')

				
					for section in sections:
        title = section.find('h2').text.strip()
        print(f"Section Title: {title}")

				
					#Example 14 Element in both classes (e.g., .race-section AND .race-50) (Order doesnt matter)
race50_section = soup_html.select('section.race-section.race-50')

				
					race50_section_v2 = soup_html.select('section.race-50.race-section')

				
					for section in race50_section:
        print(f"Found section: {section['id']}")

				
					for section in race50_section_v2:
        print(f"Found section: {section['id']}")

				
					URL = "http://books.toscrape.com/"

				
					response = requests.get(URL)

				
					soup = BeautifulSoup(response.text, 'html.parser')

				
					#Example 15 Select category links from sidebar
category_links = soup.select("ul.nav-list ul a")

				
					for a in category_links:
        print(a.text.strip(), URL + a['href'])

				
					books = soup.select("article.product_pod h3 a")

				
					books

				
					books = soup.select("article.product_pod h3 a")

				
					for book in books:
      print(book.get_text())

				
					books = soup.select("article.product_pod")

				
					for book in books:
        title = book.select_one("h3 a")["title"]
        price = book.select_one(".price_color").text
        rating = book.select_one("p.star-rating")["class"][-1]  # e.g. 'Three'
        book_data.append({"title": title, "price": price, "rating": rating})

				
					df = pd.DataFrame(book_data)

				
					df.head(10)

				
					df['price_clean'] = df['price'].str.replace('Â£', '', regex=False).astype(float)

				
					#Convert GBP to USD (example rate: 1 GBP = 1.0737 USD) (CHECK THIS)
exchange_rate = 1.0737

				
					df['price_usd'] = df['price_clean'] * exchange_rate

				
					df['Price_usd'] = df['price_usd'].apply(lambda x: f"${x:.2f}")

				
					df_five_star = df.loc[df['rating'] == 'Five', ['title', 'Price_usd']]

				
					df_five_star

				
					df_five_star.to_csv('scrapped_book_data.csv')

				
					df_five_star.to_excel('scrapped_book_data.xlsx')

The post beautifulsoup4 Selectors first appeared on Ryan & Matt Data Science.

BeautifulSoup4 extract table

Ryan Nolan — Tue, 27 May 2025 07:31:49 +0000

				
					import requests
import pandas as pd
from bs4 import BeautifulSoup

Basic Example HTML Code -> Runners

				
					html = """
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Personal Running Bests</title>
</head>
<body>

  <h1>Personal Running Bests</h1>

  <table>
    <thead>
      <tr>
        <th>Distance</th>
        <th>Time</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>5k</td>
        <td>18:30</td>
      </tr>
      <tr>
        <td>10k</td>
        <td>37:50</td>
      </tr>
      <tr>
        <td>Half Marathon</td>
        <td>1:25:11</td>
      </tr>
      <tr>
        <td>Marathon</td>
        <td>3:17:00</td>
      </tr>
      <tr>
        <td>50 Miler</td>
        <td>9:14:30</td>
      </tr>
      <tr>
        <td>100 Miler</td>
        <td>32:11:11</td>
      </tr>
    </tbody>
  </table>

</body>
</html>
"""

Extract headers

				
					headers = [th.get_text(strip=True) for th in table.find_all("th")]

				
					headers

Step 5: Extract table rows

				
					rows = []

				
					for tr in table.find_all("tr")[1:]:  # Skip header row
        cells = [td.get_text(strip=True) for td in tr.find_all("td")]
        if cells:
            rows.append(cells)

				
					# Step 6: Create a pandas DataFrame
df = pd.DataFrame(rows, columns=headers)

df

Example 2 Metallica Ticket Sales

				
					url = "https://en.wikipedia.org/wiki/WorldWired_Tour"

				
					response = requests.get(url)

				
					html = response.text

				
					soup = BeautifulSoup(html, "html.parser")

Only Grab the First Table We don't want that in this instance

				
					table = soup.find("table")

				
					table

Grab All Tables. We don't want that in this instance, we need to target a specific table

				
					# or find all tables
tables = soup.find_all("table")

				
					tables

Target a specific table #TO DO -> 2017 Concert

				
					table = None

				
					for th in soup.find_all("th"):
        if "Date (2017)" in th.get_text():
            table = th.find_parent("table")
            break

				
					table

				
					headers = [th.get_text(strip=True) for th in table.find('tr').find_all('th')]

				
					for tr in table.find_all('tr')[1:]:  # Skip header row
        cells = tr.find_all(['th', 'td'])
        row = [cell.get_text(strip=True).replace('\xa0', ' ') for cell in cells]
        rows.append(row)

				
					df = pd.DataFrame(rows, columns=headers[:len(rows[0])])  # Avoid header mismatch

df

Cleaning up data, tons of ways we can start cleaning up date, some sites super easy and nice tables, this one isnt the best, should have multiple support columns etc

rename date column

				
					df.rename(columns={'Date (2017)': 'date'}, inplace=True)

remove [] in date column

				
					df['date'] = df['date'].str.replace(r'\[.*?\]', '', regex=True).str.strip()

df

forward fill to fix city, venue, and country issue

				
					df['City'] = df['City'].ffill()
df['Country'] = df['Country'].ffill()
df['Venue'] = df['Venue'].ffill()

df

fix column names

				
					df.columns = df.columns.str.strip()

				
					df.columns = df.columns.str.replace(' ', '_')

				
					df.columns = df.columns.str.lower()

df

Fix attendance issue

				
					def is_attendance(val):
        if pd.isna(val):
            return False
        pattern = r'^\d{1,3}(?:,\d{3})? ?/ ?\d{1,3}(?:,\d{3})?$'
        return bool(pd.Series(val).str.contains(pattern, regex=True)[0])

df

Fix stadium issue

				
					venue_keywords = r'\b(?:Arena|Center|Field|Stadium|Garden|Park|Speedway)\b'

				
					mask = df['country'].str.contains(venue_keywords, case=False, na=False)

				
					df.loc[mask, 'venue'] = df.loc[mask, 'country']

				
					df.loc[mask, 'country'] = None  # Clear them from 'country'

df

Export to CSV

				
					df.to_csv("metallica.csv", index=False)

The post BeautifulSoup4 extract table first appeared on Ryan & Matt Data Science.