Python Web Scraping with BeautifulSoup and Requests
Web scraping is one of the most useful skills in a developer's toolkit. Let's build a scraper from zero to production.
Install
pip install requests beautifulsoup4
Basic Scrape
import requests
from bs4 import BeautifulSoup
url = "https://quotes.toscrape.com"
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, 'html.parser')
for quote in soup.select('.quote'):
text = quote.select_one('.text').get_text()
author = quote.select_one('.author').get_text()
print(f"{author}: {text}")
Faking a Real Browser
Many sites block Python's default user-agent:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 Chrome/120.0 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
}
response = requests.get(url, headers=headers, timeout=10)
Handling Pagination
import time
base_url = "https://quotes.toscrape.com/page/{}/"
all_quotes = []
for page in range(1, 11): # pages 1-10
response = requests.get(base_url.format(page), headers=headers)
if response.status_code != 200:
break
soup = BeautifulSoup(response.text, 'html.parser')
quotes = soup.select('.quote .text')
if not quotes:
break
all_quotes.extend([q.get_text() for q in quotes])
time.sleep(1) # be polite, don't hammer the server
print(f"Collected {len(all_quotes)} quotes")
Saving to CSV
import csv
with open('quotes.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Author', 'Quote', 'Tags'])
for item in all_data:
writer.writerow([item['author'], item['quote'], ','.join(item['tags'])])
Respecting robots.txt
Always check https://example.com/robots.txt before scraping. Python has a built-in parser:
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
print(rp.can_fetch("*", "https://example.com/page"))
Scrape responsibly — add delays, respect robots.txt, and don't overload servers.
0 Comments
Join the conversation
No comments yet. Be the first!