web scraping with python
import requests from urllib.parse import urljoin import urllib.robotparser
Part 1 Getting your first page
def response_code(response): if response.status_code == 200: print("Page fetched successfully!") else: print("Failed to retrieve page:", response.status_code)
URL = "http://books.toscrape.com/"
url_response = requests.get(URL)
response_code(url_response)

Failed to retrieve page: 403 | a client does not have the necessary permissions to access a specific web page or resource on a server
Â
URL2 = “https://www.compreoalquile.com”
URL2 = "https://www.baseball-reference.com/players/s/suzukic01.shtml"
url_response_2 = requests.get(URL2)
response_code(url_response_2)

Failed to retrieve page: 404 | page doesnt exist
URL3 = "https://ryanandmattdatascience.com/100miles"
url_response_3 = requests.get(URL3)
response_code(url_response_3)

Example 2 Checking robots.txt file
def check_robots(url): robots_url = urljoin(url, '/robots.txt') response = requests.get(robots_url) print(response.text)
check_robots('https://www.amazon.com')

#Do not scrape disallowed paths — it’s unethical and can get you blocked
Â
#User-agent: *
#Disallow: /private/
#This means scrapers should avoid https://www.compreoalquile.com/develop.
Â
#Look for the following, next example will show what to look at also
#User-agent: *
#Crawl-delay: 10
#This means bots should wait 10 seconds between requests.
Example 3 Checking robots.txt file - Look For Delays
Â
#Check rate limits
#If the site offers an official API, always use that first — and they usually include rate limits in the docs.
#Example (from GitHub API docs):
#”You can make up to 60 requests per hour for unauthenticated requests
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://www.amazon.com")
delay = rp.crawl_delay("*")
print(delay) # Might return something like 10

Example 4
#Checking robots.txt file if you can scrape a page
#Use urllib.robotparser to respect rules
#Check for Disallow, Crawl-delay
#Handle 429 or 503 errors with backoff
#Be sure to add delays between pages.
rp = urllib.robotparser.RobotFileParser()
rp.set_url('https://www.amazon.com/robots.txt')
rp.read()
print(rp.can_fetch("*", "https://www.amazon.com/CELSIUS-Fitness-Energy-Standard-Variety/dp/B06X6J5266/?_encoding=UTF8&pd_rd_w=U0HVD&content-id=amzn1.sym.9d904e2e-b55a-4ad0-aa4c-f9665fbd0e0d&pf_rd_p=9d904e2e-b55a-4ad0-aa4c-f9665fbd0e0d&pf_rd_r=VXNM0SX2W4PCFA2KB7FJ&pd_rd_wg=KJlEC&pd_rd_r=8a75547a-24ff-4998-bdea-f875d0f05448&ref_=pd_hp_d_btf_crs_zg_bs_16310101"))

Example 5 Headers
In a later video we will go over rotating headers
URL = "http://books.toscrape.com/"
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0 Safari/537.36' }
requests.get(URL, headers=headers)

url = "https://ultrasignup.com/results_event.aspx?did=96529"
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36" }
# Fetch the page response = requests.get(url, headers=headers)
response

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.