Python Web Scraping with BeautifulSoup and Requests

Web scraping is one of the most useful skills in a developer's toolkit. Let's build a scraper from zero to production.

Install

pip install requests beautifulsoup4

Basic Scrape

import requests
from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com"
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, 'html.parser')

for quote in soup.select('.quote'):
    text   = quote.select_one('.text').get_text()
    author = quote.select_one('.author').get_text()
    print(f"{author}: {text}")

Faking a Real Browser

Many sites block Python's default user-agent:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                  'AppleWebKit/537.36 Chrome/120.0 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
}
response = requests.get(url, headers=headers, timeout=10)

Handling Pagination

import time

base_url = "https://quotes.toscrape.com/page/{}/"
all_quotes = []

for page in range(1, 11):          # pages 1-10
    response = requests.get(base_url.format(page), headers=headers)
    if response.status_code != 200:
        break
    soup = BeautifulSoup(response.text, 'html.parser')
    quotes = soup.select('.quote .text')
    if not quotes:
        break
    all_quotes.extend([q.get_text() for q in quotes])
    time.sleep(1)                  # be polite, don't hammer the server

print(f"Collected {len(all_quotes)} quotes")

Saving to CSV

import csv

with open('quotes.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['Author', 'Quote', 'Tags'])
    for item in all_data:
        writer.writerow([item['author'], item['quote'], ','.join(item['tags'])])

Respecting robots.txt

Always check https://example.com/robots.txt before scraping. Python has a built-in parser:

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
print(rp.can_fetch("*", "https://example.com/page"))

Scrape responsibly — add delays, respect robots.txt, and don't overload servers.