web scraping with python

  import requests from urllib.parse import urljoin import urllib.robotparser

Part 1 Getting your first page

  def response_code(response): if response.status_code == 200: print("Page fetched successfully!") else: print("Failed to retrieve page:", response.status_code)
  URL = "http://books.toscrape.com/"
  url_response = requests.get(URL)
  response_code(url_response)
Failed to retrieve page: 403 | a client does not have the necessary permissions to access a specific web page or resource on a server
Â
URL2 = “https://www.compreoalquile.com”
  URL2 = "https://www.baseball-reference.com/players/s/suzukic01.shtml"
  url_response_2 = requests.get(URL2)
  response_code(url_response_2)
Failed to retrieve page: 404 | page doesnt exist
  URL3 = "https://ryanandmattdatascience.com/100miles"
  url_response_3 = requests.get(URL3)
  response_code(url_response_3)

Example 2 Checking robots.txt file

  def check_robots(url): robots_url = urljoin(url, '/robots.txt') response = requests.get(robots_url) print(response.text)
  check_robots('https://www.amazon.com')
#Do not scrape disallowed paths — it’s unethical and can get you blocked

Â

#User-agent: *
#Disallow: /private/
#This means scrapers should avoid https://www.compreoalquile.com/develop.

Â

#Look for the following, next example will show what to look at also
#User-agent: *
#Crawl-delay: 10
#This means bots should wait 10 seconds between requests.

Example 3 Checking robots.txt file - Look For Delays

Â
#Check rate limits
#If the site offers an official API, always use that first — and they usually include rate limits in the docs.
#Example (from GitHub API docs):
#”You can make up to 60 requests per hour for unauthenticated requests
  rp = urllib.robotparser.RobotFileParser()
  rp.set_url("https://www.amazon.com")
  delay = rp.crawl_delay("*")
  print(delay) # Might return something like 10

Example 4

#Checking robots.txt file if you can scrape a page
#Use urllib.robotparser to respect rules
#Check for Disallow, Crawl-delay
#Handle 429 or 503 errors with backoff
#Be sure to add delays between pages.
  rp = urllib.robotparser.RobotFileParser()
  rp.set_url('https://www.amazon.com/robots.txt')
  rp.read()
  print(rp.can_fetch("*", "https://www.amazon.com/CELSIUS-Fitness-Energy-Standard-Variety/dp/B06X6J5266/?_encoding=UTF8&pd_rd_w=U0HVD&content-id=amzn1.sym.9d904e2e-b55a-4ad0-aa4c-f9665fbd0e0d&pf_rd_p=9d904e2e-b55a-4ad0-aa4c-f9665fbd0e0d&pf_rd_r=VXNM0SX2W4PCFA2KB7FJ&pd_rd_wg=KJlEC&pd_rd_r=8a75547a-24ff-4998-bdea-f875d0f05448&ref_=pd_hp_d_btf_crs_zg_bs_16310101"))

Example 5 Headers

In a later video we will go over rotating headers
  URL = "http://books.toscrape.com/"
  headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0 Safari/537.36' }
  requests.get(URL, headers=headers)
  url = "https://ultrasignup.com/results_event.aspx?did=96529"
  headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36" }
  # Fetch the page response = requests.get(url, headers=headers)
  response

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Leave a Reply

Your email address will not be published. Required fields are marked *