Skip to main content

System Design Questions

Design Web Crawler — System Design Interview Guide

Design Web Crawler is a system-design interview that asks you to build a distributed crawler that discovers and downloads pages from the public web: tens of billions of URLs known, billions of pages fetched per day, deduplication of content, politeness toward target hosts, and prioritization of crawling effort. The hard part is the URL frontier and politeness at scale.

By Alex Chen, Founder, InterviewChamp.AI · Last verified

Reported in interviews at

  • Google
  • Microsoft (Bing)
  • Amazon
  • Meta
  • Apple

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

  • Discover new URLs from seed pages by extracting links during crawling
  • Fetch pages over HTTP/HTTPS with appropriate retries and timeouts
  • Respect robots.txt and target-host politeness (rate limit per domain)
  • Detect and deduplicate identical or near-identical content
  • Prioritize URLs by importance (popular sites, frequently-updated pages) vs freshness needs
  • Output fetched pages and metadata into a downstream indexing pipeline

Non-functional requirements

  • Scale: ~50 billion known URLs, ~10 billion pages fetched per month
  • Throughput: ~5K pages/sec sustained, ~50K pages/sec peak
  • Politeness: never exceed 1-2 requests/sec per target domain (configurable per-domain)
  • Availability: 99.9%; some crawler downtime is tolerable; permanent backlog growth is not
  • Storage: petabyte-scale for raw page content and crawl metadata

Capacity estimation

Scale anchors based on public web research: ~50 billion URLs in the indexable web, ~10 billion pages refreshed per month for major search engines. Translates to ~3,800 pages/sec sustained.

With fan-out: each crawled page yields ~50-100 outbound links on average, so the URL-discovery rate is ~50x the fetch rate (~200K URLs/sec discovered). The frontier (queue of URLs known but not yet crawled) grows even larger than the crawled corpus.

Storage: average page size ~100 KB (HTML + assets). 10B pages/month × 100 KB = 1 PB/month of new content. With historical retention of 12 months: ~12 PB hot storage; older content goes to cold storage. Metadata (URL, fetch timestamp, status code, content hash, language, parsed text) is ~1 KB/page × 10B/month = 10 TB/month.

URL frontier: known but uncrawled URLs. 50B URLs total, ~10B yet-to-be-crawled at any moment. Each URL frontier entry is ~200 bytes (URL plus priority and politeness metadata). Frontier size: 50B × 200 bytes = 10 TB. The frontier is the single largest in-memory data structure; production systems partition it across many nodes.

Domain politeness: ~100M+ active domains globally; per-domain crawl rate is bounded (typically 1-2 req/sec). Aggregating across all domains gives the headline throughput. The constraint forces broad parallelism — to hit 5K pages/sec sustained, the crawler must be hitting 5K+ different domains simultaneously.

Bandwidth: 10B pages/month × 100 KB = 1 PB/month inbound = ~3 Gbps continuous. Modest by data center standards.

High-level design

Four core components: URL frontier, fetchers, content processors, and the seen-URLs index.

URL frontier: a priority-ordered queue of URLs to crawl. Each entry holds the URL, a priority score, and politeness metadata (target domain, last-fetched timestamp for that domain). The frontier is partitioned: one option is partitioning by hash of the URL, another is partitioning by target domain (every URL on a given domain lives in the same partition — simplifies politeness enforcement). Most production crawlers shard by domain.

Fetchers: workers that pull URLs from the frontier and execute HTTP fetches. Each worker holds a connection pool, respects robots.txt and per-domain rate limits, applies retries, and emits the fetched page to the content processor.

Content processors: pipeline workers that parse fetched pages. Each page is hashed for dedup, parsed into HTML/text/metadata, scanned for outbound links, classified for spam/quality, and emitted to the indexer. Discovered links are returned to the URL frontier (after dedup against the seen-URLs index).

Seen-URLs index: a massive set of URLs already known to the system (whether crawled, queued, or seen and rejected). Used to dedup discovered links before adding them to the frontier. At 50B URLs the seen-set is the largest distributed data structure in the crawler. Production options include a bloom-filter cluster (memory-efficient, accepts a small false-positive rate) or a sharded key-value store of URL hashes.

Robots.txt cache: every domain has a robots.txt file describing what the crawler is allowed to fetch. Crawler fetches robots.txt periodically per domain, caches the parsed rules in memory, and consults the cache on every URL it considers. Robots.txt is also a politeness signal (Crawl-Delay directive sets per-domain rate).

DNS cache: at 5K fetches/sec, DNS resolution would dominate latency without caching. Each domain's IP is cached with TTL.

Downstream: parsed pages flow to an indexer (for search), to a content store (for retrieval and replay), and to specialized pipelines (image extraction, structured-data extraction). The crawler's job ends at page delivery; downstream concerns are separate.

Deep dive — the hard problem

Two deep dives: politeness and the URL frontier prioritization.

Politeness — the central operational constraint. A crawler that ignores robots.txt or hammers a target domain at 1000 req/sec will get blocked, sued, or both. Two layers of politeness.

Per-domain rate limit: each target domain has a maximum crawl rate (default 1-2 req/sec; some domains specify a Crawl-Delay in robots.txt; very large sites are explicitly allowlisted to higher rates). The crawler enforces this with per-domain token buckets. Before fetching a URL, the worker checks the target domain's bucket; if no tokens, the URL is deferred (sent back to the frontier with a delayed-visibility timestamp) and the worker picks a URL from a different domain.

Shard the frontier by domain: each domain's URLs live in one frontier partition. A fetcher worker handles a set of domains; the politeness state for those domains is local memory. This makes rate limiting cheap and avoids global coordination.

Robots.txt and meta-robots tags: respect the standard. Fetch robots.txt before crawling any URL on a domain; cache for some hours. Parse Disallow paths and the User-Agent header rules. Some sites use meta tags in HTML (<meta name='robots' content='noindex'>) to opt individual pages out — the content processor must check these.

Identification: the crawler must use a clear User-Agent header identifying itself, with a contact URL. This is operational hygiene that mature sites expect.

URL frontier prioritization: not all URLs are equal. Three signal layers feed into the priority score.

PageRank-like signals: pages linked from many other pages are more important. Each URL's priority is roughly the sum of priorities of pages that link to it. Computed via iterative graph algorithms over the link graph; updated periodically (weekly to monthly).

Freshness signals: pages on news sites or social media need to be re-crawled frequently (every few hours); pages on stable corporate sites can be re-crawled monthly. Per-domain or per-page freshness models predict the optimal recrawl interval based on observed change rates.

Freshness vs coverage tradeoff: the crawler's total throughput is fixed; spending more on freshness for popular pages means less on coverage of long-tail pages. Production crawlers split throughput by policy: a 'fresh tier' (most popular sites, frequent recrawls), a 'broad coverage tier' (occasional pulls across a wider URL set), and a 'discovery tier' (new URLs encountered during crawling).

Duplicate detection: pages with identical or near-identical content waste bandwidth and storage. Two levels of dedup: exact (hash the page content; same hash = identical) and near-duplicate (SimHash or MinHash — generate a fingerprint such that similar pages produce similar fingerprints; compare candidates within a hash distance). Near-dup detection runs in the content processor; duplicates are recorded but not propagated to the indexer.

Third tradeoff: traps and adversarial sites. Some sites have infinite-URL generation (calendar pages with /year/month/day/event/yyyy URLs covering all of history) — the crawler can spend infinite time on one site. Defenses: URL pattern detection (regex against known-trap patterns), per-domain depth limits, and per-domain URL count caps. Mention this — production crawlers spend significant operational effort here.

Fourth: distributed coordination. The crawler is fundamentally distributed — different fetcher pools, different content processors, all sharing the same frontier and seen-URLs index. The membership and partition map must be propagated to all workers; partitions can move when nodes are added or removed. Production crawlers use a coordination service to manage partition ownership and worker assignments.

Common mistakes

  • Skipping politeness — without per-domain rate limits and robots.txt, target sites block the crawler within hours
  • Treating the URL frontier as a simple queue without priority — production crawlers must rank URLs by importance
  • Forgetting deduplication — without near-duplicate detection, the corpus is full of mirrored content wasting storage and indexer effort
  • Designing a single global frontier — at petabyte scale the frontier must be sharded, and sharding by domain is the standard answer for politeness
  • Ignoring crawler traps — adversarial or buggy sites generate infinite URLs and consume the crawler's budget unfairly

Likely follow-up questions

  • How would your design handle a sudden 10x increase in crawled domains (a major new site goes live)?
  • What changes if you have to crawl JavaScript-heavy pages that require a headless browser?
  • How would you detect when a previously-stable site changes its URL structure and the crawler is hitting 404s?
  • How would you implement focused crawling (crawl all pages on a specific topic, ignore everything else)?
  • How would you support a 'priority recrawl' API that lets external systems request immediate refresh of specific URLs?

Practice Design Web Crawler live with an AI interviewer

Free, no sign-up required. Get real-time feedback on your design.

Practice these live

Frequently asked questions

How long is the Design Web Crawler interview?
60 minutes typical. Senior+ rounds expect coverage of the URL frontier, politeness, deduplication, and at least one of (prioritization, traps, distributed coordination).
Do I need to know PageRank in detail?
Naming it and explaining 'pages linked from popular pages are more important; computed iteratively' is enough. Drawing the actual matrix iteration is overkill.
Should I cover headless-browser rendering for JavaScript sites?
Mention it as an extension if time permits — production crawlers have a separate 'JS-rendering' tier with dedicated resources because rendering is 10-100x more expensive than HTML fetch. Don't drill in unless asked.
What is the single most important concept for Design Web Crawler?
Politeness enforced via per-domain rate limits and frontier sharding by domain. Almost every senior signal hinges on whether the candidate raises politeness unprompted and proposes a defensible enforcement mechanism.