From "one thread chasing links" to a globally distributed pipeline that crawls 15 billion pages a month — politely, durably, and without DDoS-ing anyone
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
Imagine Googlebot waking up at midnight. It has one URL on its desk — https://www.google.com. It opens it, reads the HTML, scribbles down every link it finds (say 200 of them), and adds them to a giant TODO list. Then it picks the next URL off the list, opens that one, scribbles down its links, and keeps going. Two weeks later, that one starting URL has fanned out into 15 billion pages indexed and stored. That's a web crawler — a piece of software that systematically browses the internet by following links, downloading pages, and feeding them to a search engine's index, an archive, or a data-mining pipeline.
Crawlers power more than search engines. They're how the Internet Archive snapshots the web, how SEO tools like Ahrefs build link graphs, how price comparison sites learn that a TV dropped $50, and how academic researchers measure the spread of misinformation. Whenever a system needs a fresh, structured view of the public web, a crawler is the thing that goes and gets it.
Before drawing a single box, pin down what the crawler must do. In an interview, asking these questions out loud signals you understand why the production design looks the way it does.
robots.txt — the website's rule file that says which paths a crawler may or may not visitrequests + BeautifulSoup is a working crawler for 100 pages. The architecture below exists because we need 15 billion pages in 4 weeks while staying off every site's blocklist.What's in scope today, and what should the architecture leave room for tomorrow? A crawler that hard-codes "HTML over HTTP" will be unusable the day someone asks it to also index PDFs or follow FTP links. Decide upfront which axes are flexible.
Start with HTML only. But the parser is a pluggable module — adding a PDF extractor or an image OCR pass tomorrow should be a new class behind the same interface, not a fork of the codebase. Sarah on the search team will thank us when she wants to index research papers.
Start with HTTP and HTTPS only — they cover 99.9% of the public web. But the fetcher layer is also pluggable. If we ever need to mirror an FTP archive or a gRPC API surface, we add an FTPFetcher behind the same Fetcher interface.
Every site can publish https://example.com/robots.txt declaring which paths crawlers may visit. Honoring it is non-negotiable — ignore it and you get banned, sued, or both. Cache each host's robots.txt aggressively (one parse per host per day) so we don't re-fetch it on every page.
Numbers drive every architectural choice — sharding, queue depth, bandwidth provisioning. Do them out loud, even if rough. Our target: crawl 15 billion pages every 4 weeks (a refresh interval that matches Google's reported re-crawl cadence for medium-rank sites).
15B pages over 4 weeks = 15,000,000,000 / (4 × 7 × 86,400) ≈ 6,200 pages/second. That's the steady-state rate the entire pipeline must sustain — fetch, parse, dedupe, store — for a month straight.
~6,200
15B / (4 weeks)
100 KB
HTML + assets metadata
500 B
URL, headers, timing
~620 MB/s
~5 Gbps sustained
15B pages × (100 KB page + 500 B metadata) ≈ 1.5 PB for one full crawl. With 70% capacity headroom (so we don't fill disks before the next crawl finishes): ~2.14 PB provisioned. This goes to a blob store — HDFS or S3 — not a relational DB.
| Metric | Value | Why it matters |
|---|---|---|
| Pages/sec | 6,200/s | Drives fetcher pool size and DNS resolver capacity |
| Sustained bandwidth | 5 Gbps | Forces multiple network egresses; one NIC is not enough |
| Crawl storage | 1.5 PB | Mandates blob store (HDFS/S3); no SQL DB at this scale |
| URL set size | ~60 GB | 15B × 4 bytes (checksum) — fits a sharded in-memory set with disk fallback |
| Crawl duration | 4 weeks | Long enough that worker crashes will happen — checkpointing is mandatory |
Strip away the distributed-systems machinery and a crawler is just graph traversal. Pages are nodes; hyperlinks are edges; we walk the graph from a small set of seed nodes outward.
Pseudocode — the entire crawler in 8 stepswhile frontier not empty: url = frontier.pop() # 1. take next URL ip = dns_resolver.resolve(url.host) # 2. host → IP if not robots_allowed(url): continue # 3. politeness check html = fetcher.get(url, ip) # 4. download bytes if doc_seen_before(checksum(html)): continue # 5. dedupe content storage.save(url, html) # 6. persist for link in parser.extract_links(html): # 7. discover new edges if not url_seen_before(link): frontier.push(link) # 8. enqueue
Breadth-first — explore all of news.com's direct links before going deeper. The frontier is a FIFO queue. This is the default because it spreads load across many hosts naturally — we crawl shallow across the whole web before going deep into any one site, which keeps freshness uniform.
Depth-first within one host is useful for amortizing the cost of TCP/TLS handshake and DNS — once a connection to news.com is open, drain a bunch of its URLs through that same connection before tearing it down. We use DFS inside a host's queue while BFS is the global discipline.
The frontier is not one queue — it's a forest of per-host queues fronted by a global priority dispatcher. We'll see why in §8.
If you build the pseudocode from §5 verbatim, you'll hit five walls — fast. Every box in our production architecture exists to work around exactly one of these.
The web is roughly 50 billion pages and growing. Even at 6,200 pages/sec we need 4 weeks. We can't crawl everything; we must prioritize — high-PageRank, recently-changed, and seed-anchored URLs first.
Pages change while we're crawling them. By the time we've indexed nytimes.com, the homepage has been edited 50 times. We accept stale-by-design and re-crawl on a refresh cadence — but the system must tolerate "what I just saw is already wrong".
If 1,000 fetcher threads all pile onto blog.example.com at once, we just DDoS'd that site. They'll firewall our IP range within minutes and we lose access permanently. We must throttle per host — no more than 1 connection at a time to any one site, with a delay between requests.
Some sites generate infinite pages — a calendar that lets you click "next month" forever, search results with infinite pagination, dynamically-generated session-id URLs that all return the same content. A naive crawler will sit in example.com/calendar?date=... for the rest of its life and never reach the rest of the web.
The same page is reachable via dozens of URLs (?utm_source=..., mobile redirects, mirror sites). And many distinct URLs serve identical content (CDN replicas, syndicated articles). Without dedupe we'd index the same article 50 times and waste 95% of our storage.
Every fetch starts with "what's the IP for this hostname?". The OS DNS resolver does this synchronously and one query at a time per process — at 6,200 pages/sec hitting different hosts, DNS alone becomes the bottleneck. We need our own caching async resolver.
This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. The numbers from §4 and the difficulties from §6 drive every decision.
Sketch the simplest possible crawler: one process, one thread, one URL queue in memory. Pop a URL → fetch → parse → push new links → repeat. The whole thing fits on one laptop.
Four concrete failures emerge the moment we try to scale:
One HTTP request takes ~500ms (DNS + TCP + TLS + download). At one fetch at a time, that's 2 pages per second. To crawl 15B pages serially: 15B / 2 ≈ 240 years. We need parallelism by a factor of 3,000× just to hit the target.
If we naively parallelize by spawning 3,000 threads on one queue, the queue's BFS order means ~50 threads will be hitting nytimes.com simultaneously. NYT's edge firewall sees that as a DoS and blocks our entire IP range. Our crawler is now blind to a major chunk of the web.
The thread enters example.com/cal?d=2024-01-01, finds links to ?d=2024-01-02, ?d=2024-01-03, ... It will stay in that calendar forever, growing the queue with URLs nobody will ever click, while real content gets starved out.
The OS resolver gets ~3,000 lookups/sec funneled through a single socket. It either melts or rate-limits us. Worse, if the queue is in-memory and the process crashes 12 days into a 28-day crawl, we lose everything and start from zero.
The single most important insight in crawler design is that crawling is not one job — it's a pipeline of stages with completely different bottlenecks. Network-bound fetching, CPU-bound parsing, disk-bound storage, memory-bound dedupe — each stage scales on its own axis, and conflating them in one process means everyone waits for the slowest.
Where URLs live before they're fetched. Sharded by hostname — every URL for nytimes.com goes to the same queue, served by the same worker. That's how we enforce politeness: at most one fetcher per host means no host can ever be overwhelmed by us. The frontier is disk-backed because hundreds of millions of pending URLs won't fit in RAM.
Where bytes get pulled from the web and turned into structured data. Network-bound — we want hundreds of concurrent fetches per worker to saturate bandwidth. Parsing is CPU-bound and runs on the same workers, but in a separate thread pool so a slow parse doesn't block a fetch and vice versa. Both feed a document input stream (DIS) — a buffer that lets multiple processors (link extractor, content checksum, language detector) all read the same downloaded page without re-fetching.
Where deduped pages and their metadata land permanently. Disk-bound. Raw HTML goes to a blob store (HDFS or S3) — petabyte-scale, sequential writes. Metadata (URL, fetch time, status, checksum) goes to a sharded DB. The dedupe sets — what we've already seen — also live here, sharded by URL hash.
This three-plane split is what makes the crawler scale. Want more pages/sec? Add fetch workers — they don't touch the frontier's queue logic or the storage layer. Want to handle a flood of small sites? Shard the frontier finer — fetchers don't care. Each plane has its own knob.
Now the full picture. Every node is numbered — find its matching card below to see what it does and crucially what would break without it.
Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.
The starting set — typically the front pages of major news sites, popular blogs, and curated directories. A few thousand URLs is enough; the link graph fans out exponentially from there. Engineers update the seed list quarterly to bias the crawl toward known-fresh content.
Solves: the cold-start problem. Without seeds the crawler has no entry point — it would sit forever waiting for a URL to appear. Seeds are the initial energy injected into the graph traversal.
The crawler's TODO list — a priority queue of URLs waiting to be fetched, sharded by hostname so all of nytimes.com's pending URLs live in one shard, served by one worker. Each shard has a per-host FIFO sub-queue plus a global priority key based on PageRank, freshness, and depth. Disk-backed (Kafka or RocksDB) because hundreds of millions of pending URLs won't fit in RAM.
Solves: politeness and ordering at the same time. Sharding by hostname means one host can never be hit by two fetchers at once — that's our DDoS-prevention invariant. Priority means we crawl high-value pages first when bandwidth is scarce.
"Have we already seen this URL?" — a sharded set of URL checksums (4-byte CRC32 each). At 15B URLs × 4 bytes = ~60 GB, sharded across nodes by URL hash. Lookups are O(1) hash-set probes; disk fallback for cold shards. Every newly-extracted link goes through here before being enqueued.
Solves: infinite re-crawl. Without URL dedupe, every page that links back to the homepage would re-enqueue the homepage, and we'd loop forever. Some teams use a Bloom filter here — faster but with false positives that drop URLs we've never seen, which is unacceptable for a crawler that promises web coverage. We pay the extra memory for an exact set.
An asynchronous, in-process DNS client with a massive LRU cache. The OS resolver is synchronous and serializes through a single socket — at 6,200 pages/sec it would melt. Our resolver caches each hostname → IP for the TTL the authoritative server returns (typically 1-24 hours), and overlaps thousands of pending lookups.
Solves: the DNS-bottleneck failure from §6. Real measurement on production crawlers: switching from glibc's getaddrinfo to an async caching resolver dropped DNS time from 40% of total fetch latency to under 2%. Without this, the entire pipeline is DNS-bound.
Stateless services that take a URL + IP from the frontier and execute the HTTP fetch. Each worker runs hundreds of concurrent fetches via async I/O (Netty / Tokio / asyncio). Reuses TCP connections per host (HTTP keep-alive) to amortize TLS handshake. Honors the crawl-delay from robots.txt — if a host says "wait 5 seconds between hits", the worker enforces that on its sub-queue.
Solves: the throughput target. With 50 workers × 200 concurrent fetches × ~8 fetches/sec each = ~80K fetches/sec headroom for a 6,200/sec target — comfortable margin for the 10× variance in real-world fetch latency.
An in-memory buffer that holds the bytes of a freshly-downloaded page so multiple processors can read it without re-fetching. The link extractor reads it. The content checksum reads it. The language detector reads it. The size-and-type metadata extractor reads it. All from the same buffer, in parallel.
Solves: redundant downloads. Without a DIS, each downstream consumer would either re-fetch the URL (catastrophic — 4× the bandwidth) or queue serially through the parser (catastrophic — 4× the latency). The DIS is the moral equivalent of a tee — one fetch, many readers.
CPU-bound stage that reads from the DIS, parses the HTML into a DOM, and extracts every href, src, and canonical tag. Normalizes URLs (resolves relative paths, lowercases hostnames, strips fragments and tracking params) so News.com/Article?utm=foo#top and news.com/article hash to the same key. Outputs the list of new URLs and the structured metadata for the page.
Solves: the graph-traversal step itself. Without link extraction, the crawler has no way to discover new URLs and would only ever fetch the seeds. Robust normalization is what keeps the URL dedupe set from exploding into a hundred variations of the same article.
"Have we already seen this content?" — even if the URL is new. Compute SHA-256 (or MD5 for speed) of the normalized page body and check against a set of seen checksums. Mirror sites, syndicated articles, and CDN replicas all collapse to the same hash and only get stored once. At 15B docs × 16-byte MD5 = ~240 GB checksum set, sharded by checksum.
Solves: wasted storage and wasted indexing work. Real-world web data is reportedly 20-30% duplicate. Without doc dedupe we'd store 1.95 PB instead of 1.5 PB and index every duplicate to no benefit. Same Bloom-filter caveat as URL dedupe — we pay the memory for an exact set.
Decides whether a freshly-extracted URL is even eligible to be enqueued. Three checks, in order: (1) URL scheme is http or https; (2) host's robots.txt permits the path (cached per-host with TTL); (3) URL doesn't match our blacklist of known traps (?sessionid=..., calendar URLs deeper than N levels, etc).
Solves: politeness compliance and trap avoidance. Without the filter, the crawler would happily fetch /admin/ and /api/v1/private/ in violation of robots.txt — and would dive infinitely into example.com/calendar?d=.... The filter is the gatekeeper that keeps us a good citizen of the web.
Where the raw bytes of every successfully-fetched, deduped page get written. Append-only, immutable, optimized for sequential writes — exactly the access pattern of a crawl. At 1.5 PB per crawl, this has to be a distributed file system. S3 for cloud-native, HDFS for on-prem. Pages are batched into ~128 MB files with an offset index so downstream consumers can mmap a chunk without listing millions of tiny files.
Solves: durable, petabyte-scale storage that the indexer can later replay. A relational DB would collapse under 1.5 PB and the workload doesn't need transactions. A blob store is the right shape for "write once, read many, sequential, huge".
A sharded database (Cassandra or DynamoDB) holding one row per crawled URL: {url, fetch_time, status_code, content_length, content_checksum, blob_offset, last_modified, host}. Lets us answer "when did we last crawl X?", "which pages 404'd this week?", and "show me every page in news.com fetched yesterday". Sharded by hostname to colocate per-host queries.
Solves: the operational visibility gap. Without a metadata DB you have a blob store full of pages and no way to query it. Want to re-crawl only pages older than 7 days? Without metadata you can't even ask the question.
Every N minutes, snapshots the in-memory state of all the volatile components — the URL frontier's heads, the URL/doc dedupe sets' deltas, each fetcher's progress — to S3. On a worker crash, recovery rolls back to the last snapshot and resumes; we lose at most N minutes of work. Snapshots are incremental (deltas only) so a 60GB URL set isn't rewritten every 5 minutes.
Solves: the multi-week-crawl problem. With 1,000 workers running for 28 days, the probability of zero crashes is essentially nil — multiple workers will die. Without checkpointing, every crash means restart-from-zero, and we never finish a crawl. Checkpointing is the difference between "this works in a demo" and "this works in production".
Two flows, mapped to the numbered components. The first is a happy-path fetch; the second is a recovery from a worker crash.
https://news.ycombinator.com at 14:00hash("news.ycombinator.com") mod 50 = 17).209.216.230.240 in 0.1ms.robots.txt for ycombinator.com: "Disallow: /api/, Crawl-delay: 1". The homepage is allowed. The worker waits at most 1 second since its previous fetch to news.ycombinator.com.html) → 9b3a.... Lookup against the seen-set: miss. We've never seen this exact content. Mark as seen.?utm_source, resolved relative paths.seg_2024_42.dat:0x4ab1900. Metadata row inserted into Metadata DB ⑪. Total elapsed: 520 ms.The frontier is the most architecturally important component in the system. Every politeness, priority, and load-balancing concern lives or dies here.
The frontier is not a single FIFO. It's a two-level structure:
The top-level priority queue picks which host to crawl next, ranked by PageRank, freshness signals, and depth. The per-host FIFO picks which URL of that host to crawl next — usually breadth-first within the host. This separation lets us respect "no two threads on one host" while still parallelizing across hosts.
Across the cluster, the frontier is sharded by hostname using consistent hashing. worker_id = consistent_hash("news.ycombinator.com") mod N. This guarantees: (a) one worker owns one host's queue, so politeness is automatic; (b) adding/removing workers reshuffles only 1/N of hosts instead of the whole cluster.
At 6,200 pages/sec for 4 weeks, the frontier holds hundreds of millions of pending URLs at peak. That doesn't fit in RAM. Implementation: Kafka per shard as the durable queue, with a small in-memory ring buffer of the next ~10K URLs per host for hot access. Enqueue/dequeue happen on the in-memory buffer; periodic flushes sync to Kafka.
The fetcher is where bytes actually leave the website and arrive in our data center. Three operational concerns dominate.
Before fetching example.com/anything, we must check example.com/robots.txt. Re-fetching it on every page would multiply our request count by 2× and re-parse it would waste CPU. We cache the parsed robots ruleset per host with a 24-hour TTL. First page on a new host pays the lookup cost; subsequent pages reuse the cached ruleset.
TCP + TLS handshake costs ~150 ms. If we open a fresh connection for every page on the same host, that's 30% of fetch time wasted on handshake. Keep-alive holds the TCP connection open for ~60 seconds and reuses it for the next page — typically the same host, since per-host queues are drained DFS-style.
Some robots.txt files include Crawl-delay: 5 — wait 5 seconds between requests. The fetcher tracks last_request_time per host and sleeps if needed. Without this we'd fire 8 requests/sec at a host that asked for 0.2 — getting rate-limited or banned.
Each fetcher worker runs a fixed-size thread pool (say 200 threads) with non-blocking I/O underneath. The thread pool is logically partitioned: each thread is sticky to a small group of hosts so per-host state (last-fetch-time, robots cache) doesn't need cross-thread locking. New URLs land on the thread that owns their host.
Many distinct URLs serve identical content — CDN mirrors, syndicated news, blogs that copy each other's articles. Without deduplication we'd store the same content many times and the search index would surface near-duplicate results.
For each downloaded page: (1) normalize — strip whitespace, scripts, tracking pixels, and dynamic ad content; (2) compute a content checksum (MD5 or SHA-1 — both are fast and 128/160 bits is plenty for collision avoidance); (3) check against the global "seen content" set; (4) if seen, skip storage and just record the URL → existing-blob mapping in metadata.
15B unique pages × 16 bytes (MD5) = 240 GB of checksums. Shard by checksum across nodes. Each node holds a fast in-memory hash set for hot checksums (recently-seen) plus a disk-backed RocksDB for cold ones. LRU pushes cold entries to disk.
Different from doc dedupe — this is "have we already enqueued this URL?", checked at link-extraction time, before the fetcher even sees it.
(1) Normalize the URL: lowercase host, strip #fragment, remove ?utm_* tracking params, sort query string keys, resolve relative paths. (2) Compute a 4-byte CRC32 or 8-byte xxHash. (3) Check against the URL-seen set, sharded by checksum.
15B URLs × 4 bytes (CRC32) = 60 GB across the cluster. Same hot/cold split as doc dedupe. Same Bloom-filter caveat: false positives drop URLs we've never crawled, so we use an exact set.
With 1,000 fetcher workers running for 28 days, the expected number of worker failures is in the hundreds. We cannot restart a 4-week crawl from zero on every crash.
Every 5 minutes per worker. Each snapshot captures: (a) the URL frontier shard heads (which URLs are next to fetch), (b) deltas to the URL/doc dedupe sets since the previous snapshot, (c) per-host last-fetch-time and robots cache TTLs. Snapshots are written to S3 with a monotonically-increasing version number.
Orchestrator detects worker death in ~30 seconds. New worker boots, fetches the latest snapshot for its assigned shards from S3, restores in-memory state (~90 seconds for 60 GB of dedupe set), resumes from the snapshot's frontier head. Worst-case data loss: 5 minutes of progress. Re-fetched pages are deduped at storage time so we don't double-write.
A 60 GB URL set rewritten every 5 minutes would saturate our network — 200 MB/sec just for snapshots. Instead we use incremental snapshots: each snapshot only contains the deltas added since the previous one. Recovery replays the chain of deltas on top of the latest base snapshot. Bases get rewritten daily during low-traffic hours.
Workers will crash. Networks will partition. Disks will fail. The architecture must shrug those off automatically.
Hostnames are mapped to workers via consistent hashing on a ring of virtual nodes. When a worker dies, only its virtual nodes are reassigned to neighbors — about 1/N of the hostname space rebalances, not the whole cluster. The replacement worker takes over those neighbors' overflow once it boots.
When Worker 17 dies, its frontier shard sits in Kafka untouched — Kafka durably stores the queue. The replacement Worker 17 reads from the same Kafka partition. No URL is lost; pending fetches just pause for ~2 minutes during failover.
Blob writes use the URL as part of the object key. If a fetch is retried after a crash, the second write either overwrites the first (S3) or is dropped (HDFS create-if-not-exists). Either way we never corrupt the dataset with a half-written page.
Where does each piece of data physically live? The partitioning strategy directly determines politeness and recoverability.
All URLs for one hostname live on one worker. This is what enforces "one connection per host". Implementation: worker = consistent_hash(host) mod N, with virtual nodes for smooth rebalancing.
URL checksums and document checksums are sharded across nodes by shard = checksum mod N. Distribution is uniform by construction. Lookups during crawl are RPC to the right shard; sub-millisecond.
S3 / HDFS handles partitioning internally. We just pick a key naming convention: crawl-YYYY-WW/seg-NNNNN.dat with offsets into each segment. The blob store's own consistent hashing distributes segments across nodes.
Metadata for all nytimes.com pages lives on the same Cassandra shard as the frontier owner of that host. Colocates write traffic and lets per-host queries (re-crawl all of news.com) run on one node without scatter/gather.
Some sites generate infinite URLs. A crawler that walks into one and doesn't recognize it spends the rest of its life there.
/cal?d=2024-01-01 links to /cal?d=2024-01-02, ad infinitum/search?q=foo&page=N with N unbounded?sid=abc123 so URL dedupe never triggerssid, session, utm_* by rule)/robots.txt, parses it once, and stores the ruleset (allowed paths, disallowed paths, crawl-delay) in an in-memory map keyed by hostname. Cache TTL is 24 hours or whatever the response's Cache-Control says. Subsequent fetches do an O(1) ruleset lookup. Net cost: 1 extra fetch per host per day, regardless of how many pages we pull from that host.nytimes.com goes to one worker, which has one thread (or a small pool) draining that host's queue serially with crawl-delay enforced between fetches. There is no path in the architecture by which two of our threads ever talk to one host simultaneously. Layer two is honoring Crawl-delay from robots.txt; layer three is exponential backoff on 429/503 responses.sid, sessionid, utm_*). (2) Per-host caps — max depth (~30) and max pages (~1M) per hostname. (3) Content-similarity detection — SimHash of the last N pages from a host; if they're all near-duplicates, blacklist the path prefix and stop following links into it. The first two are cheap and catch ~80%; the third catches sophisticated traps like faceted search.