beautifulsoup pagination

  • import requests – Allows us to make HTTP requests to web pages.

  • from bs4 import BeautifulSoup –It is used to parse and extract data from HTML content.

  • import pandas as pd – It is used for organizing and manipulating data in table format.

  • import re – It enables pattern matching using regular expressions.

  • from time import sleep – It lets us pause the script for a set amount of time.

  • import random – It is used to generate random values (e.g., for sleep delays).

  import requests from bs4 import BeautifulSoup import pandas as pd import re from time import sleep import random

Example 1 - Setting up a timer

Here , we create a timer that waits for 3 seconds  between each rounds of the loop

   for i in range(5): print(f"Doing something... round {i+1}") sleep(3) # wait 3 seconds

Example 2- setting up a random timer

Here we pick a random float between 1 and 5 seconds, then pause for the randomly chosen time.

   for i in range(5): print(f"Doing something... round {i+1}") delay = random.uniform(1, 5) # float seconds between 1 and 5 print(f"Waiting {delay:.2f} seconds...") sleep(delay)

Example 3 string format url

Here, we loop through page number 1 to 5,Â

We then create a url with the current oage number.

Then we display the url

   for page in range(1, 6): # Pages 1 to 5 url = f"https://example.com/page={page}" print(f"Fetching URL: {url}")

Example 4 Books to Scrape - static times

  • titles = [], prices = [] – Create empty lists to store book titles and prices.

  • page = 1 – Start scraping from page 1.

  • while page < 10: – Loop through pages 1 to 9.

  • url = f"http://books.toscrape.com/catalogue/page-{page}.html" – Format the URL for each page.

  • response = requests.get(url) – Fetch the page content.

  • if response.status_code == 404: break – Stop if the page doesn’t exist.

  • soup = BeautifulSoup(response.text, "html.parser") – Parse the HTML content.

  • articles = soup.find_all("article", class_="product_pod") – Get all book items on the page.

  • for article in articles: – Loop through each book item.

  • title = article.h3.a["title"] – Extract the book title.

  • price = article.find("p", class_="price_color").text.strip() – Extract and clean the book price.

  • titles.append(title), prices.append(price) – Save the title and price.

  • page += 1 – Go to the next page.

  • sleep(1) – Wait 1 second before the next request.

   titles = [] prices = [] page = 1 while page < 10: url = f"http://books.toscrape.com/catalogue/page-{page}.html" response = requests.get(url) if response.status_code == 404: break soup = BeautifulSoup(response.text, "html.parser") articles = soup.find_all("article", class_="product_pod") for article in articles: title = article.h3.a["title"] price = article.find("p", class_="price_color").text.strip() titles.append(title) prices.append(price) page += 1 sleep(1) # polite delay

Here, we create a pandas DataFrame with column Title and Price populated with data gotten from our previous operation.

  df = pd.DataFrame({ "Title": titles, "Price": prices })

Example 5 Hockey Website - Tables and multiple pages and a function and random timer

In this function, we define scrape_pages(num_pages) to scrape multiple pages from a website with a paginated table. We start by setting the base URL template and creating empty lists to store column headers and row data. For each page from 1 to the number specified, we format the URL and make a GET request to fetch the HTML content. We parse the response using BeautifulSoup and look for a table with the class "table". If a table isn’t found, we print a message and skip to the next page. On the first page, we extract the table headers. For every page, we loop through each table row with the class "team", extract and clean the cell text, and append it to our rows list. To avoid overwhelming the server, we pause for a random interval between 1 and 5 seconds after each request. Once all pages are processed, we use the collected headers and rows to create a Pandas DataFrame and return it.

   def scrape_pages(num_pages): base_url = "https://www.scrapethissite.com/pages/forms/?page={}" headers = [] rows = [] for page in range(1, num_pages + 1): url = base_url.format(page) response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser") table = soup.find("table", {"class": "table"}) if not table: print(f"No table found on page {page}") continue # Extract column headers on first page if page == 1: headers = [th.get_text(strip=True) for th in table.find_all("th")] # Extract rows for tr in table.find_all("tr", class_="team"): cells = [td.get_text(strip=True) for td in tr.find_all("td")] if cells: rows.append(cells) delay = random.uniform(1, 5) sleep(delay) # polite delay to avoid overloading server # Build DataFrame df = pd.DataFrame(rows, columns=headers) return df
  df = scrape_pages(5)

Lets filter based off data scraped.

Example - customer only wants to see winning teams

Here, we create a new dataframe ‘df2’ by filtering df to include only the rows where the number of Wins is greater than the number of Lossess.Â

This helps us keep only the teams with a winning record.

  df2 = df[df['Wins'] > df['Losses']]

Finally we save dataframe.

  df2.to_csv("scrapethissite_forms_stats.csv", index=False)

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Leave a Reply

Your email address will not be published. Required fields are marked *