Design a Web Crawler

4 min read

A web crawler collects web pages and follows links to discover new content. Use cases: search engine indexing, web archiving, web mining, web monitoring.

Step 1 - Understand the problem and establish design scope

Requirements:

Purpose: search engine indexing
1 billion pages/month, HTML only
Consider newly added/edited pages
Store HTML pages for 5 years
Ignore duplicate content pages

Key characteristics of a good crawler:

Scalability: Parallelized crawling of billions of pages
Robustness: Handle bad HTML, unresponsive servers, crashes, malicious links
Politeness: Don't overwhelm a single host with rapid requests
Extensibility: Easy to add support for new content types (images, etc.)

Back of the envelope estimation:

QPS: 1B / 30 days / 24h / 3600s = ~400 pages/s
Peak QPS: 2× = 800 pages/s
Avg page size: 500 KB → 500 TB/month
5-year storage: 500 TB × 12 × 5 = 30 PB

Step 2 - Propose high-level design and get buy-in

Components:

Seed URLs: Starting points. Divide URL space by locality (country) or topic
URL Frontier: FIFO queue storing URLs to be downloaded
HTML Downloader: Downloads pages via HTTP; uses DNS Resolver for IP resolution
Content Parser: Validates and parses downloaded HTML pages (separate component to avoid slowing crawls)
Content Seen?: Eliminates duplicate content (~29% of web is duplicated). Uses hash comparison for efficiency
Content Storage: Disk for bulk data, memory for popular content
URL Extractor: Parses links from HTML; converts relative paths to absolute URLs
URL Filter: Excludes blacklisted sites, unwanted file types, error links
URL Seen?: Tracks already-visited URLs using Bloom filter or hash table to avoid infinite loops

Workflow:

Add seed URLs to URL Frontier
HTML Downloader fetches URLs from Frontier → DNS resolution → download
Content Parser validates HTML
Content Seen? checks for duplicates → discard if duplicate
URL Extractor extracts links → URL Filter → URL Seen? → add new URLs to Frontier

Step 3 - Design deep dive

DFS vs BFS

DFS can go infinitely deep → not suitable
BFS (FIFO queue) is standard, but has problems:
- Many links point to the same host → floods that host (impolite)
- Doesn't prioritize URLs by importance (PageRank, traffic, update frequency)

URL Frontier (solves BFS problems)

Politeness: One page at a time per host. Implemented via:

Queue router ensures each FIFO queue has URLs from only one host
Mapping table: host → queue
Each worker thread bound to one queue, downloads with a delay between requests

Priority: Prioritizer computes URL usefulness (PageRank, traffic, update frequency). High-priority queues selected with higher probability.

Full URL Frontier design:

Front queues: Manage prioritization
Back queues: Manage politeness

Freshness: Recrawl based on update history; prioritize important pages.

Storage for URL Frontier: Hybrid — majority on disk, buffers in memory for enqueue/dequeue. Hundreds of millions of URLs.

HTML Downloader

Robots.txt (Robots Exclusion Protocol): Check before crawling to see allowed/disallowed paths. Cache results to avoid repeated downloads.

Performance optimizations:

Distributed crawl: Partition URL space across multiple servers with consistent hashing
Cache DNS Resolver: DNS latency 10–200ms; maintain cached domain→IP mapping updated by cron jobs
Locality: Place crawl servers geographically close to website hosts
Short timeout: Avoid hanging on slow/unresponsive servers

Robustness

Consistent hashing: Distribute load, allow adding/removing downloaders
Save crawl states and data: Restart interrupted crawls from saved state
Exception handling: Graceful error handling without crashes
Data validation: Prevent system errors

Extensibility

Plug in new modules (PNG Downloader, Web Monitor) without rearchitecting.

Detect and avoid problematic content

Redundant content: Detect duplicates using hashes/checksums
Spider traps: Infinite loops (e.g., endless directory nesting). Mitigate with URL max length; manual verification for known traps
Data noise: Exclude ads, code snippets, spam URLs — little value for indexing

Step 4 - Wrap up

Additional talking points:

Server-side rendering: Execute JavaScript to capture dynamically generated links
Anti-spam filters: Filter low-quality/spam pages
Database replication and sharding
Horizontal scaling: Keep servers stateless
Availability, consistency, reliability (Ch.1)
Analytics: Data collection for fine-tuning

Reference materials [1] US Library of Congress: https://www.loc.gov/websites/ [2] EU Web Archive: http://data.europa.eu/webarchive [3] Digimarc: https://www.digimarc.com/products/digimarc-services/piracy-intelligence [4] Heydon A., Najork M., Mercator (1999) [5] Olston, Najork: Web Crawling survey [6] 29% duplicate content: https://tinyurl.com/y6tmh55y [7] Rabin M.O., Fingerprinting by random polynomials (1981) [8] Bloom, Space/time trade-offs in hash coding (1970) [9] Patterson, Web Crawling lecture [10] Page, Brin et al., PageRank (1998) [11] Google Dynamic Rendering [12] Urvoy et al., Tracking web spam (2006) [13] Lee et al., IRLbot: Scaling to 6 billion pages (2008)

Step 1 - Understand the problem and establish design scope #

Step 2 - Propose high-level design and get buy-in #

Step 3 - Design deep dive #

DFS vs BFS #

URL Frontier (solves BFS problems) #

HTML Downloader #

Robustness #

Extensibility #

Detect and avoid problematic content #

Step 4 - Wrap up #