Web Crawler

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →

Step 1

What is a Web Crawler?

Imagine Googlebot waking up at midnight. It has one URL on its desk — https://www.google.com. It opens it, reads the HTML, scribbles down every link it finds (say 200 of them), and adds them to a giant TODO list. Then it picks the next URL off the list, opens that one, scribbles down its links, and keeps going. Two weeks later, that one starting URL has fanned out into 15 billion pages indexed and stored. That's a web crawler — a piece of software that systematically browses the internet by following links, downloading pages, and feeding them to a search engine's index, an archive, or a data-mining pipeline.

Crawlers power more than search engines. They're how the Internet Archive snapshots the web, how SEO tools like Ahrefs build link graphs, how price comparison sites learn that a TV dropped $50, and how academic researchers measure the spread of misinformation. Whenever a system needs a fresh, structured view of the public web, a crawler is the thing that goes and gets it.

The two questions that drive every design decision below: (1) How do we visit billions of pages without spending decades doing it? (2) How do we do that politely — without hammering one site so hard that it bans us, or getting trapped in an infinite loop of generated calendar pages?

Step 2

Requirements & Goals

Before drawing a single box, pin down what the crawler must do. In an interview, asking these questions out loud signals you understand why the production design looks the way it does.

✅ Functional Requirements

Start from a list of seed URLs and discover linked pages by recursively following hyperlinks
Download each page's HTML and hand it off to downstream consumers (search index, dedupe, etc.)
Skip pages we've already crawled (URL dedupe) and pages whose content we already have (document dedupe)
Honor robots.txt — the website's rule file that says which paths a crawler may or may not visit
Survive worker crashes and pause/resume across multi-week crawls without restarting from zero

⚙️ Non-Functional Requirements

Scalable — must crawl the entire visible web, hundreds of millions to tens of billions of pages
Extensible — modular pipeline so adding a new document type (PDF, video, JSON-LD) or protocol (FTP, gRPC) doesn't require a rewrite
Polite — never overwhelm a single host; respect crawl-delay directives
Efficient — saturate available bandwidth without wasting it on duplicate fetches or dead ends

The hard part isn't crawling — it's crawling at scale, politely. A 50-line Python script with requests + BeautifulSoup is a working crawler for 100 pages. The architecture below exists because we need 15 billion pages in 4 weeks while staying off every site's blocklist.

Step 3

Design Considerations

What's in scope today, and what should the architecture leave room for tomorrow? A crawler that hard-codes "HTML over HTTP" will be unusable the day someone asks it to also index PDFs or follow FTP links. Decide upfront which axes are flexible.

📄 Document types

Start with HTML only. But the parser is a pluggable module — adding a PDF extractor or an image OCR pass tomorrow should be a new class behind the same interface, not a fork of the codebase. Sarah on the search team will thank us when she wants to index research papers.

🌐 Protocols

Start with HTTP and HTTPS only — they cover 99.9% of the public web. But the fetcher layer is also pluggable. If we ever need to mirror an FTP archive or a gRPC API surface, we add an FTPFetcher behind the same Fetcher interface.

🤖 Robots Exclusion Protocol

Every site can publish https://example.com/robots.txt declaring which paths crawlers may visit. Honoring it is non-negotiable — ignore it and you get banned, sued, or both. Cache each host's robots.txt aggressively (one parse per host per day) so we don't re-fetch it on every page.

Out of scope for this design: JavaScript rendering (would require headless browsers — 100× the CPU cost), authenticated/login-walled content, real-time freshness (we batch-crawl, not live-stream), and obeying CAPTCHA challenges. These are fascinating but separate problems.

Step 4

Capacity Estimation & Constraints

Numbers drive every architectural choice — sharding, queue depth, bandwidth provisioning. Do them out loud, even if rough. Our target: crawl 15 billion pages every 4 weeks (a refresh interval that matches Google's reported re-crawl cadence for medium-rank sites).

Throughput target

15B pages over 4 weeks = 15,000,000,000 / (4 × 7 × 86,400) ≈ 6,200 pages/second. That's the steady-state rate the entire pipeline must sustain — fetch, parse, dedupe, store — for a month straight.

Pages/sec

~6,200

15B / (4 weeks)

Avg page size

100 KB

HTML + assets metadata

Metadata/page

500 B

URL, headers, timing

Bandwidth in

~620 MB/s

~5 Gbps sustained

Storage estimate

15B pages × (100 KB page + 500 B metadata) ≈ 1.5 PB for one full crawl. With 70% capacity headroom (so we don't fill disks before the next crawl finishes): ~2.14 PB provisioned. This goes to a blob store — HDFS or S3 — not a relational DB.

Metric	Value	Why it matters
Pages/sec	`6,200/s`	Drives fetcher pool size and DNS resolver capacity
Sustained bandwidth	`5 Gbps`	Forces multiple network egresses; one NIC is not enough
Crawl storage	`1.5 PB`	Mandates blob store (HDFS/S3); no SQL DB at this scale
URL set size	`~60 GB`	15B × 4 bytes (checksum) — fits a sharded in-memory set with disk fallback
Crawl duration	`4 weeks`	Long enough that worker crashes will happen — checkpointing is mandatory

Step 5

The High-Level Algorithm — BFS with friends

Strip away the distributed-systems machinery and a crawler is just graph traversal. Pages are nodes; hyperlinks are edges; we walk the graph from a small set of seed nodes outward.

Pseudocode — the entire crawler in 8 steps

while frontier not empty:
  url       = frontier.pop()                  # 1. take next URL
  ip        = dns_resolver.resolve(url.host)  # 2. host → IP
  if not robots_allowed(url): continue     # 3. politeness check
  html      = fetcher.get(url, ip)            # 4. download bytes
  if doc_seen_before(checksum(html)): continue  # 5. dedupe content
  storage.save(url, html)                     # 6. persist
  for link in parser.extract_links(html):  # 7. discover new edges
    if not url_seen_before(link):
      frontier.push(link)                     # 8. enqueue

BFS vs DFS — and why we mix them

🌊 BFS (default)

Breadth-first — explore all of news.com's direct links before going deeper. The frontier is a FIFO queue. This is the default because it spreads load across many hosts naturally — we crawl shallow across the whole web before going deep into any one site, which keeps freshness uniform.

🪜 DFS (per-host)

Depth-first within one host is useful for amortizing the cost of TCP/TLS handshake and DNS — once a connection to news.com is open, drain a bunch of its URLs through that same connection before tearing it down. We use DFS inside a host's queue while BFS is the global discipline.

The frontier is not one queue — it's a forest of per-host queues fronted by a global priority dispatcher. We'll see why in §8.

Step 6

The Five Difficulties of Crawling the Web

If you build the pseudocode from §5 verbatim, you'll hit five walls — fast. Every box in our production architecture exists to work around exactly one of these.

🌌 Volume

The web is roughly 50 billion pages and growing. Even at 6,200 pages/sec we need 4 weeks. We can't crawl everything; we must prioritize — high-PageRank, recently-changed, and seed-anchored URLs first.

⚡ Velocity

Pages change while we're crawling them. By the time we've indexed nytimes.com, the homepage has been edited 50 times. We accept stale-by-design and re-crawl on a refresh cadence — but the system must tolerate "what I just saw is already wrong".

🤝 Politeness

If 1,000 fetcher threads all pile onto blog.example.com at once, we just DDoS'd that site. They'll firewall our IP range within minutes and we lose access permanently. We must throttle per host — no more than 1 connection at a time to any one site, with a delay between requests.

🪤 Crawler traps

Some sites generate infinite pages — a calendar that lets you click "next month" forever, search results with infinite pagination, dynamically-generated session-id URLs that all return the same content. A naive crawler will sit in example.com/calendar?date=... for the rest of its life and never reach the rest of the web.

🔁 Deduplication

The same page is reachable via dozens of URLs (?utm_source=..., mobile redirects, mirror sites). And many distinct URLs serve identical content (CDN replicas, syndicated articles). Without dedupe we'd index the same article 50 times and waste 95% of our storage.

🐢 DNS bottleneck

Every fetch starts with "what's the IP for this hostname?". The OS DNS resolver does this synchronously and one query at a time per process — at 6,200 pages/sec hitting different hosts, DNS alone becomes the bottleneck. We need our own caching async resolver.

Step 7 · CORE

High-Level Architecture — From Naive to Production

This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. The numbers from §4 and the difficulties from §6 drive every decision.

Pass 1 — The naive design (and why it breaks)

Sketch the simplest possible crawler: one process, one thread, one URL queue in memory. Pop a URL → fetch → parse → push new links → repeat. The whole thing fits on one laptop.

flowchart LR SEED["Seed URLs"] --> Q["In-memory queue"] Q --> F["Fetch HTTP"] F --> P["Parse HTML"] P --> Q P --> S["Store on disk"]

Four concrete failures emerge the moment we try to scale:

💥 Single-thread fetch — decades, not weeks

One HTTP request takes ~500ms (DNS + TCP + TLS + download). At one fetch at a time, that's 2 pages per second. To crawl 15B pages serially: 15B / 2 ≈ 240 years. We need parallelism by a factor of 3,000× just to hit the target.

💥 Politeness violation — instant ban

If we naively parallelize by spawning 3,000 threads on one queue, the queue's BFS order means ~50 threads will be hitting nytimes.com simultaneously. NYT's edge firewall sees that as a DoS and blocks our entire IP range. Our crawler is now blind to a major chunk of the web.

💥 Crawler trap — infinite loop

The thread enters example.com/cal?d=2024-01-01, finds links to ?d=2024-01-02, ?d=2024-01-03, ... It will stay in that calendar forever, growing the queue with URLs nobody will ever click, while real content gets starved out.

💥 DNS dominates — and crashes

The OS resolver gets ~3,000 lookups/sec funneled through a single socket. It either melts or rate-limits us. Worse, if the queue is in-memory and the process crashes 12 days into a 28-day crawl, we lose everything and start from zero.

Pass 2 — The mental model: A pipeline of independently scaling stages

The single most important insight in crawler design is that crawling is not one job — it's a pipeline of stages with completely different bottlenecks. Network-bound fetching, CPU-bound parsing, disk-bound storage, memory-bound dedupe — each stage scales on its own axis, and conflating them in one process means everyone waits for the slowest.

🧭 Frontier Plane

Where URLs live before they're fetched. Sharded by hostname — every URL for nytimes.com goes to the same queue, served by the same worker. That's how we enforce politeness: at most one fetcher per host means no host can ever be overwhelmed by us. The frontier is disk-backed because hundreds of millions of pending URLs won't fit in RAM.

⚡ Fetch & Parse Plane

Where bytes get pulled from the web and turned into structured data. Network-bound — we want hundreds of concurrent fetches per worker to saturate bandwidth. Parsing is CPU-bound and runs on the same workers, but in a separate thread pool so a slow parse doesn't block a fetch and vice versa. Both feed a document input stream (DIS) — a buffer that lets multiple processors (link extractor, content checksum, language detector) all read the same downloaded page without re-fetching.

💾 Storage Plane

Where deduped pages and their metadata land permanently. Disk-bound. Raw HTML goes to a blob store (HDFS or S3) — petabyte-scale, sequential writes. Metadata (URL, fetch time, status, checksum) goes to a sharded DB. The dedupe sets — what we've already seen — also live here, sharded by URL hash.

This three-plane split is what makes the crawler scale. Want more pages/sec? Add fetch workers — they don't touch the frontier's queue logic or the storage layer. Want to handle a flood of small sites? Shard the frontier finer — fetchers don't care. Each plane has its own knob.

Pass 3 — The production shape

Now the full picture. Every node is numbered — find its matching card below to see what it does and crucially what would break without it.

flowchart TB SEED["① Seed URLs"] subgraph FRONTIER["Frontier Plane"] UF["② URL Frontier — sharded by hostname"] UD["③ URL Dedupe Set"] end subgraph FETCH["Fetch & Parse Plane"] DNS["④ DNS Resolver — async, cached"] FW["⑤ Fetcher Workers"] DIS["⑥ Document Input Stream"] PAR["⑦ HTML Parser / Link Extractor"] DD["⑧ Document Dedupe — content checksum"] UF2["⑨ URL Filter — robots.txt, blacklist"] end subgraph STORE["Storage Plane"] BS["⑩ Blob Store — HDFS / S3"] MDB["⑪ Metadata DB"] CKP["⑫ Checkpointing Service"] end SEED --> UF UF --> DNS DNS --> FW FW --> DIS DIS --> PAR DIS --> DD PAR --> UF2 UF2 -.new URL.-> UD UD -.unseen.-> UF DD -.unique.-> BS PAR --> MDB CKP -.snapshots.-> UF CKP -.snapshots.-> UD CKP -.snapshots.-> DD style SEED fill:#e8743b,stroke:#e8743b,color:#fff style UF fill:#171d27,stroke:#e8743b,color:#d4dae5 style UD fill:#171d27,stroke:#e8743b,color:#d4dae5 style DNS fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style FW fill:#171d27,stroke:#4a90d9,color:#d4dae5 style DIS fill:#171d27,stroke:#4a90d9,color:#d4dae5 style PAR fill:#171d27,stroke:#9b72cf,color:#d4dae5 style DD fill:#171d27,stroke:#9b72cf,color:#d4dae5 style UF2 fill:#171d27,stroke:#9b72cf,color:#d4dae5 style BS fill:#171d27,stroke:#38b265,color:#d4dae5 style MDB fill:#171d27,stroke:#38b265,color:#d4dae5 style CKP fill:#171d27,stroke:#d4a838,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.

① Seed URLs

The starting set — typically the front pages of major news sites, popular blogs, and curated directories. A few thousand URLs is enough; the link graph fans out exponentially from there. Engineers update the seed list quarterly to bias the crawl toward known-fresh content.

Solves: the cold-start problem. Without seeds the crawler has no entry point — it would sit forever waiting for a URL to appear. Seeds are the initial energy injected into the graph traversal.

② URL Frontier

The crawler's TODO list — a priority queue of URLs waiting to be fetched, sharded by hostname so all of nytimes.com's pending URLs live in one shard, served by one worker. Each shard has a per-host FIFO sub-queue plus a global priority key based on PageRank, freshness, and depth. Disk-backed (Kafka or RocksDB) because hundreds of millions of pending URLs won't fit in RAM.

Solves: politeness and ordering at the same time. Sharding by hostname means one host can never be hit by two fetchers at once — that's our DDoS-prevention invariant. Priority means we crawl high-value pages first when bandwidth is scarce.

③ URL Dedupe Set

"Have we already seen this URL?" — a sharded set of URL checksums (4-byte CRC32 each). At 15B URLs × 4 bytes = ~60 GB, sharded across nodes by URL hash. Lookups are O(1) hash-set probes; disk fallback for cold shards. Every newly-extracted link goes through here before being enqueued.

Solves: infinite re-crawl. Without URL dedupe, every page that links back to the homepage would re-enqueue the homepage, and we'd loop forever. Some teams use a Bloom filter here — faster but with false positives that drop URLs we've never seen, which is unacceptable for a crawler that promises web coverage. We pay the extra memory for an exact set.

④ DNS Resolver

An asynchronous, in-process DNS client with a massive LRU cache. The OS resolver is synchronous and serializes through a single socket — at 6,200 pages/sec it would melt. Our resolver caches each hostname → IP for the TTL the authoritative server returns (typically 1-24 hours), and overlaps thousands of pending lookups.

Solves: the DNS-bottleneck failure from §6. Real measurement on production crawlers: switching from glibc's getaddrinfo to an async caching resolver dropped DNS time from 40% of total fetch latency to under 2%. Without this, the entire pipeline is DNS-bound.

⑤ Fetcher Workers

Stateless services that take a URL + IP from the frontier and execute the HTTP fetch. Each worker runs hundreds of concurrent fetches via async I/O (Netty / Tokio / asyncio). Reuses TCP connections per host (HTTP keep-alive) to amortize TLS handshake. Honors the crawl-delay from robots.txt — if a host says "wait 5 seconds between hits", the worker enforces that on its sub-queue.

Solves: the throughput target. With 50 workers × 200 concurrent fetches × ~8 fetches/sec each = ~80K fetches/sec headroom for a 6,200/sec target — comfortable margin for the 10× variance in real-world fetch latency.

⑥ Document Input Stream (DIS)

An in-memory buffer that holds the bytes of a freshly-downloaded page so multiple processors can read it without re-fetching. The link extractor reads it. The content checksum reads it. The language detector reads it. The size-and-type metadata extractor reads it. All from the same buffer, in parallel.

Solves: redundant downloads. Without a DIS, each downstream consumer would either re-fetch the URL (catastrophic — 4× the bandwidth) or queue serially through the parser (catastrophic — 4× the latency). The DIS is the moral equivalent of a tee — one fetch, many readers.

⑦ HTML Parser / Link Extractor

CPU-bound stage that reads from the DIS, parses the HTML into a DOM, and extracts every href, src, and canonical tag. Normalizes URLs (resolves relative paths, lowercases hostnames, strips fragments and tracking params) so News.com/Article?utm=foo#top and news.com/article hash to the same key. Outputs the list of new URLs and the structured metadata for the page.

Solves: the graph-traversal step itself. Without link extraction, the crawler has no way to discover new URLs and would only ever fetch the seeds. Robust normalization is what keeps the URL dedupe set from exploding into a hundred variations of the same article.

⑧ Document Dedupe

"Have we already seen this content?" — even if the URL is new. Compute SHA-256 (or MD5 for speed) of the normalized page body and check against a set of seen checksums. Mirror sites, syndicated articles, and CDN replicas all collapse to the same hash and only get stored once. At 15B docs × 16-byte MD5 = ~240 GB checksum set, sharded by checksum.

Solves: wasted storage and wasted indexing work. Real-world web data is reportedly 20-30% duplicate. Without doc dedupe we'd store 1.95 PB instead of 1.5 PB and index every duplicate to no benefit. Same Bloom-filter caveat as URL dedupe — we pay the memory for an exact set.

⑨ URL Filter

Decides whether a freshly-extracted URL is even eligible to be enqueued. Three checks, in order: (1) URL scheme is http or https; (2) host's robots.txt permits the path (cached per-host with TTL); (3) URL doesn't match our blacklist of known traps (?sessionid=..., calendar URLs deeper than N levels, etc).

Solves: politeness compliance and trap avoidance. Without the filter, the crawler would happily fetch /admin/ and /api/v1/private/ in violation of robots.txt — and would dive infinitely into example.com/calendar?d=.... The filter is the gatekeeper that keeps us a good citizen of the web.

⑩ Blob Store (HDFS / S3)

Where the raw bytes of every successfully-fetched, deduped page get written. Append-only, immutable, optimized for sequential writes — exactly the access pattern of a crawl. At 1.5 PB per crawl, this has to be a distributed file system. S3 for cloud-native, HDFS for on-prem. Pages are batched into ~128 MB files with an offset index so downstream consumers can mmap a chunk without listing millions of tiny files.

Solves: durable, petabyte-scale storage that the indexer can later replay. A relational DB would collapse under 1.5 PB and the workload doesn't need transactions. A blob store is the right shape for "write once, read many, sequential, huge".

⑪ Metadata DB

A sharded database (Cassandra or DynamoDB) holding one row per crawled URL: {url, fetch_time, status_code, content_length, content_checksum, blob_offset, last_modified, host}. Lets us answer "when did we last crawl X?", "which pages 404'd this week?", and "show me every page in news.com fetched yesterday". Sharded by hostname to colocate per-host queries.

Solves: the operational visibility gap. Without a metadata DB you have a blob store full of pages and no way to query it. Want to re-crawl only pages older than 7 days? Without metadata you can't even ask the question.

⑫ Checkpointing Service

Every N minutes, snapshots the in-memory state of all the volatile components — the URL frontier's heads, the URL/doc dedupe sets' deltas, each fetcher's progress — to S3. On a worker crash, recovery rolls back to the last snapshot and resumes; we lose at most N minutes of work. Snapshots are incremental (deltas only) so a 60GB URL set isn't rewritten every 5 minutes.

Solves: the multi-week-crawl problem. With 1,000 workers running for 28 days, the probability of zero crashes is essentially nil — multiple workers will die. Without checkpointing, every crash means restart-from-zero, and we never finish a crawl. Checkpointing is the difference between "this works in a demo" and "this works in production".

Concrete walkthrough — Crawling Hacker News at 14:00

Two flows, mapped to the numbered components. The first is a happy-path fetch; the second is a recovery from a worker crash.

📥 Happy path — `https://news.ycombinator.com` at 14:00

The URL was added to Seed URLs ① yesterday and is now sitting in the URL Frontier ② shard owned by Worker 17 (chosen because hash("news.ycombinator.com") mod 50 = 17).
Worker 17 pops the URL. It hits DNS Resolver ④ — cache hit returns 209.216.230.240 in 0.1ms.
Before fetching, Worker 17 consults its cached robots.txt for ycombinator.com: "Disallow: /api/, Crawl-delay: 1". The homepage is allowed. The worker waits at most 1 second since its previous fetch to news.ycombinator.com.
Fetcher Worker ⑤ does the HTTPS GET, downloads 47 KB of HTML, hands the bytes to a DIS ⑥ buffer.
Two consumers read from the DIS in parallel. Document Dedupe ⑧ computes MD5(html) → 9b3a.... Lookup against the seen-set: miss. We've never seen this exact content. Mark as seen.
HTML Parser ⑦ extracts 30 outbound links — 28 to other ycombinator.com pages, 2 to external sites. Each is normalized: lowercased host, stripped ?utm_source, resolved relative paths.
Each of the 30 links runs through URL Filter ⑨. 3 are blacklisted (link to a known calendar trap), 2 are blocked by robots.txt, 25 pass.
The 25 surviving URLs hit the URL Dedupe Set ③. 20 are already known (we've crawled HN before). 5 are genuinely new — they get enqueued back to the appropriate frontier shards.
Raw HTML lands in Blob Store ⑩ at offset seg_2024_42.dat:0x4ab1900. Metadata row inserted into Metadata DB ⑪. Total elapsed: 520 ms.

🔧 Recovery — Worker 23 crashes at hour 280 of a 672-hour crawl

Worker 23 dies (OOM, hardware failure, k8s eviction — doesn't matter why). Its 4 million in-flight pending URLs and the per-worker checksums it accumulated since the last checkpoint are gone from RAM.
The orchestrator (k8s, Nomad) detects the missing heartbeat within 30 seconds and schedules a replacement worker on a fresh node.
The new Worker 23 boots and asks Checkpointing Service ⑫ for the latest snapshot of its frontier shard and dedupe deltas. Snapshot was taken 4 minutes ago; restoration takes ~90 seconds (loading 60 GB of CRC32 set from S3).
The worker resumes from the snapshot. Pages fetched in the 4-minute gap will be re-fetched — wasted bandwidth, but acceptable. Crucially, the dedupe state means the re-fetched pages won't be re-stored: Document Dedupe ⑧ still has their checksums from the earlier fetch.
Consistent hashing on the frontier shards means no other worker had to redistribute its load. Worker 23 picks up exactly where it left off.

So what: the architecture is built around three insights — (1) crawling is a pipeline of stages with different bottlenecks so each stage scales on its own knob; (2) politeness is enforced by sharding the frontier on hostname so no host can ever see more than one of our connections; (3) multi-week crawls assume crashes will happen so checkpointing is mandatory, not optional. Every box in the diagram earns its place by removing one of the failure modes from Pass 1.

Step 8

URL Frontier — the crawler's TODO list

The frontier is the most architecturally important component in the system. Every politeness, priority, and load-balancing concern lives or dies here.

Two-level queue structure

The frontier is not a single FIFO. It's a two-level structure:

flowchart TB subgraph FRONT["URL Frontier — Worker 17 shard"] subgraph PRI["Priority queue — selects next host to crawl"] P1["news.ycombinator.com — score 0.92"] P2["github.com — score 0.81"] P3["medium.com — score 0.74"] end subgraph SUB["Per-host FIFO sub-queues"] S1["news.ycombinator.com → 4,200 URLs"] S2["github.com → 18,000 URLs"] S3["medium.com → 950 URLs"] end PRI --> SUB end style P1 fill:#171d27,stroke:#e8743b,color:#d4dae5 style P2 fill:#171d27,stroke:#4a90d9,color:#d4dae5 style P3 fill:#171d27,stroke:#9b72cf,color:#d4dae5

The top-level priority queue picks which host to crawl next, ranked by PageRank, freshness signals, and depth. The per-host FIFO picks which URL of that host to crawl next — usually breadth-first within the host. This separation lets us respect "no two threads on one host" while still parallelizing across hosts.

Sharding

Across the cluster, the frontier is sharded by hostname using consistent hashing. worker_id = consistent_hash("news.ycombinator.com") mod N. This guarantees: (a) one worker owns one host's queue, so politeness is automatic; (b) adding/removing workers reshuffles only 1/N of hosts instead of the whole cluster.

Disk-backing

At 6,200 pages/sec for 4 weeks, the frontier holds hundreds of millions of pending URLs at peak. That doesn't fit in RAM. Implementation: Kafka per shard as the durable queue, with a small in-memory ring buffer of the next ~10K URLs per host for hot access. Enqueue/dequeue happen on the in-memory buffer; periodic flushes sync to Kafka.

Analogy: the frontier is like a busy restaurant kitchen with one fryer per chef. The expediter (priority queue) decides whose ticket to call next based on prep time and importance; once called, that one chef works through their ticket from top to bottom (per-host FIFO). Two chefs never share a fryer — that's the politeness invariant. And the ticket spike at lunch rush is absorbed by the order printer (Kafka) instead of trying to keep every ticket on the expediter's clipboard (RAM).

Step 9

Fetcher Module

The fetcher is where bytes actually leave the website and arrive in our data center. Three operational concerns dominate.

🤖 robots.txt cache

Before fetching example.com/anything, we must check example.com/robots.txt. Re-fetching it on every page would multiply our request count by 2× and re-parse it would waste CPU. We cache the parsed robots ruleset per host with a 24-hour TTL. First page on a new host pays the lookup cost; subsequent pages reuse the cached ruleset.

🔌 HTTP keep-alive

TCP + TLS handshake costs ~150 ms. If we open a fresh connection for every page on the same host, that's 30% of fetch time wasted on handshake. Keep-alive holds the TCP connection open for ~60 seconds and reuses it for the next page — typically the same host, since per-host queues are drained DFS-style.

⏱️ Crawl-delay enforcement

Some robots.txt files include Crawl-delay: 5 — wait 5 seconds between requests. The fetcher tracks last_request_time per host and sleeps if needed. Without this we'd fire 8 requests/sec at a host that asked for 0.2 — getting rate-limited or banned.

Concurrency model

Each fetcher worker runs a fixed-size thread pool (say 200 threads) with non-blocking I/O underneath. The thread pool is logically partitioned: each thread is sticky to a small group of hosts so per-host state (last-fetch-time, robots cache) doesn't need cross-thread locking. New URLs land on the thread that owns their host.

Step 10

Document Deduplication

Many distinct URLs serve identical content — CDN mirrors, syndicated news, blogs that copy each other's articles. Without deduplication we'd store the same content many times and the search index would surface near-duplicate results.

Algorithm

For each downloaded page: (1) normalize — strip whitespace, scripts, tracking pixels, and dynamic ad content; (2) compute a content checksum (MD5 or SHA-1 — both are fast and 128/160 bits is plenty for collision avoidance); (3) check against the global "seen content" set; (4) if seen, skip storage and just record the URL → existing-blob mapping in metadata.

Sizing

15B unique pages × 16 bytes (MD5) = 240 GB of checksums. Shard by checksum across nodes. Each node holds a fast in-memory hash set for hot checksums (recently-seen) plus a disk-backed RocksDB for cold ones. LRU pushes cold entries to disk.

Why not a Bloom filter? A Bloom filter is much smaller (8 GB at 1% false-positive rate) but its errors go the wrong way: it can report "I've seen this" when we actually haven't, which means we'd silently drop genuinely-new content. For a crawler whose entire promise is "we cover the web", a 1% false-positive rate is catastrophic — we'd lose 150 million pages. We pay the 30× memory cost for an exact set.

Step 11

URL Deduplication

Different from doc dedupe — this is "have we already enqueued this URL?", checked at link-extraction time, before the fetcher even sees it.

Algorithm

(1) Normalize the URL: lowercase host, strip #fragment, remove ?utm_* tracking params, sort query string keys, resolve relative paths. (2) Compute a 4-byte CRC32 or 8-byte xxHash. (3) Check against the URL-seen set, sharded by checksum.

Sizing

15B URLs × 4 bytes (CRC32) = 60 GB across the cluster. Same hot/cold split as doc dedupe. Same Bloom-filter caveat: false positives drop URLs we've never crawled, so we use an exact set.

The normalization step is where 90% of the engineering effort goes. Every site has its own URL quirks: trailing slashes, capitalized paths, session IDs in the path, fragment-based SPA routes. A poor normalizer means we crawl the same page 50 times under 50 different keys; an aggressive normalizer collapses different pages into the same key and we miss them. The normalizer is iteratively tuned by examining duplicates in the metadata DB and adding rules.

Step 12

Checkpointing — surviving a 4-week crawl

With 1,000 fetcher workers running for 28 days, the expected number of worker failures is in the hundreds. We cannot restart a 4-week crawl from zero on every crash.

📸 Snapshot cadence

Every 5 minutes per worker. Each snapshot captures: (a) the URL frontier shard heads (which URLs are next to fetch), (b) deltas to the URL/doc dedupe sets since the previous snapshot, (c) per-host last-fetch-time and robots cache TTLs. Snapshots are written to S3 with a monotonically-increasing version number.

🔁 Recovery

Orchestrator detects worker death in ~30 seconds. New worker boots, fetches the latest snapshot for its assigned shards from S3, restores in-memory state (~90 seconds for 60 GB of dedupe set), resumes from the snapshot's frontier head. Worst-case data loss: 5 minutes of progress. Re-fetched pages are deduped at storage time so we don't double-write.

Incremental snapshots

A 60 GB URL set rewritten every 5 minutes would saturate our network — 200 MB/sec just for snapshots. Instead we use incremental snapshots: each snapshot only contains the deltas added since the previous one. Recovery replays the chain of deltas on top of the latest base snapshot. Bases get rewritten daily during low-traffic hours.

Step 13

Fault Tolerance

Workers will crash. Networks will partition. Disks will fail. The architecture must shrug those off automatically.

🪪 Consistent hashing for shard assignment

Hostnames are mapped to workers via consistent hashing on a ring of virtual nodes. When a worker dies, only its virtual nodes are reassigned to neighbors — about 1/N of the hostname space rebalances, not the whole cluster. The replacement worker takes over those neighbors' overflow once it boots.

📋 Orphan queue redistribution

When Worker 17 dies, its frontier shard sits in Kafka untouched — Kafka durably stores the queue. The replacement Worker 17 reads from the same Kafka partition. No URL is lost; pending fetches just pause for ~2 minutes during failover.

🔄 Idempotent storage

Blob writes use the URL as part of the object key. If a fetch is retried after a crash, the second write either overwrites the first (S3) or is dropped (HDFS create-if-not-exists). Either way we never corrupt the dataset with a half-written page.

Step 14

Data Partitioning

Where does each piece of data physically live? The partitioning strategy directly determines politeness and recoverability.

🌐 Frontier — partitioned by hostname

All URLs for one hostname live on one worker. This is what enforces "one connection per host". Implementation: worker = consistent_hash(host) mod N, with virtual nodes for smooth rebalancing.

🔢 Dedupe sets — partitioned by checksum

URL checksums and document checksums are sharded across nodes by shard = checksum mod N. Distribution is uniform by construction. Lookups during crawl are RPC to the right shard; sub-millisecond.

💾 Blob store — partitioned by storage system

S3 / HDFS handles partitioning internally. We just pick a key naming convention: crawl-YYYY-WW/seg-NNNNN.dat with offsets into each segment. The blob store's own consistent hashing distributes segments across nodes.

📒 Metadata DB — partitioned by hostname

Metadata for all nytimes.com pages lives on the same Cassandra shard as the frontier owner of that host. Colocates write traffic and lets per-host queries (re-crawl all of news.com) run on one node without scatter/gather.

Step 15

Crawler Traps

Some sites generate infinite URLs. A crawler that walks into one and doesn't recognize it spends the rest of its life there.

🪤 Common traps

Calendar pages — /cal?d=2024-01-01 links to /cal?d=2024-01-02, ad infinitum
Search-result pagination — /search?q=foo&page=N with N unbounded
Session-id URLs — every visit gets a new ?sid=abc123 so URL dedupe never triggers
Faceted-search combinatorics — e-commerce filters generate millions of URLs for the same products

🛡️ Mitigations

Max URL depth per host — cap at 20-50 levels from the homepage
Max pages per host — cap at 1M (or whatever the host's importance justifies)
Content-similarity detection — if 100 consecutive pages from the same host all hash to near-identical content (via SimHash), blacklist that path prefix
Aggressive query-param stripping in URL normalization (drop sid, session, utm_* by rule)
Operator blacklist — manually flagged trap patterns updated weekly from observed crawl data

The hardest traps are the ones that look legitimate. An e-commerce site with 10K products × 5 facet dimensions × 4 sort orders = 1.6M valid URLs that all show roughly the same products. None of them is "wrong" — they're all real pages — but indexing all of them is a waste. SimHash-based content similarity is the only general defense; manual rules cover the rest.

Step 16

Interview Q&A

When would you use BFS vs DFS for crawling?

BFS as the global discipline, DFS within a host. BFS spreads load uniformly across the web — we crawl the homepage and one level deep on a million sites before going deeper on any one of them, which keeps freshness uniform. DFS within a single host lets us reuse the open TCP/TLS connection (HTTP keep-alive) across many fetches before tearing it down — saving ~150ms of handshake per fetch. The two-level frontier (priority across hosts, FIFO within a host) is exactly this hybrid encoded in data structures.

How do you respect robots.txt without parsing it on every request?

Per-host cache with TTL. Before the first fetch on a new host, the worker downloads /robots.txt, parses it once, and stores the ruleset (allowed paths, disallowed paths, crawl-delay) in an in-memory map keyed by hostname. Cache TTL is 24 hours or whatever the response's Cache-Control says. Subsequent fetches do an O(1) ruleset lookup. Net cost: 1 extra fetch per host per day, regardless of how many pages we pull from that host.

How do you avoid DDoS-ing a website?

Sharding the frontier by hostname is the structural answer. Every URL for nytimes.com goes to one worker, which has one thread (or a small pool) draining that host's queue serially with crawl-delay enforced between fetches. There is no path in the architecture by which two of our threads ever talk to one host simultaneously. Layer two is honoring Crawl-delay from robots.txt; layer three is exponential backoff on 429/503 responses.

How do you handle a crawler trap?

Three layers of defense. (1) URL normalization strips known trap-inducing query params (sid, sessionid, utm_*). (2) Per-host caps — max depth (~30) and max pages (~1M) per hostname. (3) Content-similarity detection — SimHash of the last N pages from a host; if they're all near-duplicates, blacklist the path prefix and stop following links into it. The first two are cheap and catch ~80%; the third catches sophisticated traps like faceted search.

What if a page changes while you're indexing it?

Accept staleness as a property of the system. Crawls are batch operations on a snapshot of the web; by the time the search index is built, parts of the snapshot are already obsolete. Mitigation is a refresh policy — important pages (high PageRank, news sites) are re-crawled every few hours; medium pages every few days; long-tail pages monthly. The metadata DB tracks last-fetched timestamps so the priority scorer can promote stale pages back to the frontier.

Why is a Bloom filter risky for URL/document dedup?

Bloom filters have false positives, not false negatives. A "yes I've seen this" reply might be wrong; a "no" is always correct. For dedupe, that means we'd occasionally drop a genuinely-new URL because the filter mistakenly says we've seen it — and that page is then never crawled. At a 1% false-positive rate over 15B URLs, that's 150 million missed pages — pages our search results will never include. For a crawler whose entire promise is web coverage, this is unacceptable. We pay 30× the memory for an exact hash set.

How do you scale to crawl the entire web instead of 15 billion pages?

Linear scale on every plane. Doubling the page target means: (a) double the fetcher workers to maintain pages/sec; (b) double the frontier shards; (c) double the dedupe-set shards. None of these touches the others — the three planes are independent. The non-linear cost is bandwidth out of the data center, which scales with pages × page-size; that's where physical limits (peering, ISP contracts) eventually bite. The practical ceiling is set by money and politeness, not architecture.

How do you prioritize which URLs to crawl first when bandwidth is scarce?

The frontier's top-level priority queue ranks hosts by a composite score. Inputs: PageRank of the host, time-since-last-crawl (freshness), HTTP response codes from the previous crawl (penalize sites that 5xx'd a lot), and seed-anchor proximity. Per-host, we also score within the queue by depth-from-homepage (shallower wins) and last-modified header (newer wins). The scorer is a small ML model in production, but simple linear weights are usually within 10% of optimal.

The one-line summary the interviewer remembers: "It's a pipeline of independently scaling stages — a hostname-sharded URL frontier in Kafka, a pool of async fetchers honoring robots.txt and crawl-delay per host, content/URL dedupe via exact hash sets sharded by checksum, raw HTML to S3, metadata to a sharded DB, and incremental checkpoints every 5 minutes so a 4-week crawl survives the inevitable worker failures."

What is a Web Crawler?

Requirements & Goals

✅ Functional Requirements

⚙️ Non-Functional Requirements

Design Considerations

📄 Document types

🌐 Protocols

🤖 Robots Exclusion Protocol

Capacity Estimation & Constraints

Throughput target

Pages/sec

Avg page size

Metadata/page

Bandwidth in

Storage estimate

The High-Level Algorithm — BFS with friends

BFS vs DFS — and why we mix them

🌊 BFS (default)

🪜 DFS (per-host)

The Five Difficulties of Crawling the Web

🌌 Volume

⚡ Velocity

🤝 Politeness

🪤 Crawler traps

🔁 Deduplication

🐢 DNS bottleneck

High-Level Architecture — From Naive to Production

Pass 1 — The naive design (and why it breaks)

💥 Single-thread fetch — decades, not weeks

💥 Politeness violation — instant ban

💥 Crawler trap — infinite loop

💥 DNS dominates — and crashes

Pass 2 — The mental model: A pipeline of independently scaling stages

🧭 Frontier Plane

⚡ Fetch & Parse Plane

💾 Storage Plane

Pass 3 — The production shape

Component-by-component — what each numbered box does

① Seed URLs

② URL Frontier

③ URL Dedupe Set

④ DNS Resolver

⑤ Fetcher Workers

⑥ Document Input Stream (DIS)

⑦ HTML Parser / Link Extractor

⑧ Document Dedupe

⑨ URL Filter

⑩ Blob Store (HDFS / S3)

⑪ Metadata DB

⑫ Checkpointing Service

Concrete walkthrough — Crawling Hacker News at 14:00

📥 Happy path — https://news.ycombinator.com at 14:00

🔧 Recovery — Worker 23 crashes at hour 280 of a 672-hour crawl

URL Frontier — the crawler's TODO list

Two-level queue structure

Sharding

Disk-backing

Fetcher Module

🤖 robots.txt cache

🔌 HTTP keep-alive

⏱️ Crawl-delay enforcement

Concurrency model

Document Deduplication

Algorithm

Sizing

URL Deduplication

Algorithm

Sizing

Checkpointing — surviving a 4-week crawl

📸 Snapshot cadence

🔁 Recovery

Incremental snapshots

Fault Tolerance

🪪 Consistent hashing for shard assignment

📋 Orphan queue redistribution

🔄 Idempotent storage

Data Partitioning

🌐 Frontier — partitioned by hostname

🔢 Dedupe sets — partitioned by checksum

📥 Happy path — `https://news.ycombinator.com` at 14:00