Web Scraping

web scraping with python

				
					import requests
from urllib.parse import urljoin
import urllib.robotparser

Part 1 Getting your first page

				
					def response_code(response):
      if response.status_code == 200:
        print("Page fetched successfully!")
      else:
        print("Failed to retrieve page:", response.status_code)

				
					URL = "http://books.toscrape.com/"

				
					url_response = requests.get(URL)

				
					response_code(url_response)

Failed to retrieve page: 403 | a client does not have the necessary permissions to access a specific web page or resource on a server

URL2 = “https://www.compreoalquile.com”

				
					URL2 = "https://www.baseball-reference.com/players/s/suzukic01.shtml"

				
					url_response_2 = requests.get(URL2)

				
					response_code(url_response_2)

Failed to retrieve page: 404 | page doesnt exist

				
					URL3 = "https://ryanandmattdatascience.com/100miles"

				
					url_response_3 = requests.get(URL3)

				
					response_code(url_response_3)

Example 2 Checking robots.txt file

				
					def check_robots(url):
        robots_url = urljoin(url, '/robots.txt')
        response = requests.get(robots_url)
        print(response.text)

				
					check_robots('https://www.amazon.com')

#Do not scrape disallowed paths — it’s unethical and can get you blocked

#User-agent: *

#Disallow: /private/

#This means scrapers should avoid https://www.compreoalquile.com/develop.

#Look for the following, next example will show what to look at also

#User-agent: *

#Crawl-delay: 10

#This means bots should wait 10 seconds between requests.

Example 3 Checking robots.txt file - Look For Delays

#Check rate limits

#If the site offers an official API, always use that first — and they usually include rate limits in the docs.

#Example (from GitHub API docs):

#”You can make up to 60 requests per hour for unauthenticated requests

				
					rp = urllib.robotparser.RobotFileParser()

				
					rp.set_url("https://www.amazon.com")

				
					delay = rp.crawl_delay("*")

				
					print(delay)  # Might return something like 10

Example 4

#Checking robots.txt file if you can scrape a page

#Use urllib.robotparser to respect rules

#Check for Disallow, Crawl-delay

#Handle 429 or 503 errors with backoff

#Be sure to add delays between pages.

				
					rp = urllib.robotparser.RobotFileParser()

				
					rp.set_url('https://www.amazon.com/robots.txt')

				
					rp.read()

				
					print(rp.can_fetch("*", "https://www.amazon.com/CELSIUS-Fitness-Energy-Standard-Variety/dp/B06X6J5266/?_encoding=UTF8&pd_rd_w=U0HVD&content-id=amzn1.sym.9d904e2e-b55a-4ad0-aa4c-f9665fbd0e0d&pf_rd_p=9d904e2e-b55a-4ad0-aa4c-f9665fbd0e0d&pf_rd_r=VXNM0SX2W4PCFA2KB7FJ&pd_rd_wg=KJlEC&pd_rd_r=8a75547a-24ff-4998-bdea-f875d0f05448&ref_=pd_hp_d_btf_crs_zg_bs_16310101"))

Example 5 Headers

In a later video we will go over rotating headers

				
					URL = "http://books.toscrape.com/"

				
					headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0 Safari/537.36'
}

				
					requests.get(URL, headers=headers)

				
					url = "https://ultrasignup.com/results_event.aspx?did=96529"

				
					headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
}

				
					# Fetch the page
response = requests.get(url, headers=headers)

				
					response

Free Community

Join 1,000+ AI Automation Builders

Weekly tutorials, live calls & direct access to Ryan & Matt.

Join Free →

Ryan Nolan

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

web scraping with python

Table of Contents

Part 1 Getting your first page

Example 2 Checking robots.txt file

Example 3 Checking robots.txt file - Look For Delays

Example 4

Example 5 Headers

Join 1,000+ AI Automation Builders

Ryan Nolan

Important Links

LinkedIn

Social Media

Keep Learning

beautifulsoup pagination

BeautifulSoup4 find vs find_all

beautifulsoup4 Selectors

BeautifulSoup4 extract table