From "as you type, the right words appear" to a sharded in-memory trie that answers 60K queries/sec under 100ms — the architecture, the data structure, and the offline pipeline that keeps it fresh
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
Sarah opens Google. She taps the "s" key. By the time her finger lifts, the search box already shows ten suggestions — "spotify", "starbucks near me", "stock market". She types "y", then "s", then "t" — each keypress refreshes the list in under 100 milliseconds. By the third character ("sys"), the suggestion she actually wanted ("system design interview") appears at the top. She taps it. The whole interaction took 1.2 seconds and she never finished typing. That tiny dropdown is what we're designing — and at Google scale, it serves 60,000 partial queries every second from a knowledge of a billion past searches.
Typeahead (also called search autocomplete or incremental search) is a deceptively simple feature: given a prefix the user has typed, return the top-K most relevant complete queries that start with that prefix. The simplicity is on the user's side — the system has to do it across a billion possible terms, ranked by popularity, in less time than the gap between two of Sarah's keypresses.
Pin down the contract before drawing any boxes. Typeahead's user-facing behavior is simple, but the latency budget is brutal — anything slower than a keypress feels broken.
Numbers drive every architectural choice — sharding, in-memory storage, cache sizing. Do them out loud, even rough. Assume Google-scale traffic.
Google sees roughly 5 billion searches per day. That averages to ~60,000 queries per second. But typeahead is hit on every keystroke, not every search — if the average user types 6 characters before tapping a suggestion, that's potentially 6× the QPS. With client-side debouncing (we discuss in §14), assume one suggestion request per 2-3 effective keypresses, so plan for ~200K typeahead requests/sec at peak.
Of those 5B daily queries, about 20% are unique — that's 1B unique queries/day. We don't index every one of them; the long tail is endless and most of it is junk. Index roughly the top 50% by frequency over a rolling window — call it ~100M unique terms worth indexing. Average query length is around 15 characters; at 2 bytes per character (covering Unicode) that's 30 bytes per term.
~5 billion
Google-scale baseline
~200K req/sec
Per-keystroke after debouncing
~100M
Top 50% worth indexing
~30 B
15 chars × 2 bytes
Base index: 100M × 30 bytes = 3 GB. With overhead for the trie structure (parent/child pointers, top-K cache per node, frequency counters), realistic factor is roughly 5-10×, so ~25 GB total. Adjust for 2% growth/year and a few years of headroom — call it ~50 GB. That fits comfortably in a single server's RAM.
| Metric | Value | Why it matters |
|---|---|---|
| Typeahead QPS | 200K/s | Drives shard count + replica count |
| Index in RAM | ~25-50 GB | Fits on one server — disk never touched on read |
| Latency budget | <100 ms | Per keypress; UI feels broken above this |
| Top-K returned | 10 | Bounded — keeps payload tiny |
| Index freshness | ~hourly | Acceptable — search trends move slowly relative to keystrokes |
One endpoint does 99% of the work. Define it now so the architecture knows its contract.
REST API surface// The only hot endpoint — fired on every (debounced) keystroke GET /api/v1/suggestions?prefix=syst&num=10 Headers: X-API-Dev-Key: abc123... X-User-Id: sarah_42 // optional — enables personalization X-Geo: IN-DL // optional — enables regional ranking → 200 OK { "prefix": "syst", "suggestions": [ "system design interview", "system design primer", "system of a down", "system32 folder", "systemctl restart", ... ] } // Ingestion (internal) — search clicks/completions feed back POST /api/v1/log { "query": "system design interview", "user_id": "sarah_42", "ts": 1730884320, "geo": "IN-DL" } → 202 Accepted
num bounded (max 10-20) so payload stays under a single TCP packet (~1500 bytes).
The data-structure choice is the single most important decision in this entire design. Get this right and the rest falls out naturally; get it wrong and no amount of clever caching will save the latency budget.
Our access pattern is exactly: "given a prefix string, find all stored strings that start with that prefix, ranked by frequency, return top K". Three observations:
Every lookup is a prefix, never a full string. Hash maps are useless — they need exact keys. Sorted lists with binary search work but each lookup is O(log N × prefix_length) and you need a separate sort by frequency.
"system", "system design", "system design interview" share the prefix "system". A naive list-of-strings store the "system" prefix three times. A tree that shares prefixes stores it once.
Walking every term under a prefix to sort by frequency on each request is too slow. The data structure needs to cache top-K at every node, so a query is one walk to the prefix node + read its top-K array.
A trie is a tree where each node represents a single character, the path from root to any node spells out a prefix, and a flag on the node marks "this prefix is a complete searchable term". Picture a phonebook where you've labeled every shelf with a starting letter, every sub-shelf with the second letter, and so on — to find names starting with "Bro" you walk three shelves deep, then read everything below.
To answer "give me top 10 terms starting with syst": walk root → s → y → s → t (4 character-hops), then read this node's precomputed top-K array. Total cost: 4 pointer dereferences plus one array read. That's microseconds, not milliseconds — easily fitting our 100ms budget with room to spare for network and serialization.
Build the architecture in three passes: simplest plausible thing, why it breaks, and the production shape where every box justifies itself with a failure it removes. This is the section that wins or loses the interview.
Sketch the simplest possible system. One app server, one SQL database with a searches(term, freq) table. To answer "suggestions for prefix syst", run a SQL LIKE query.
SELECT term FROM searches WHERE term LIKE 'syst%' ORDER BY freq DESC LIMIT 10;
Three concrete failures emerge the moment Sarah and her billion friends start typing:
Even with a B-tree index, a prefix-LIKE on a billion-row table reads thousands of matching rows just to sort them by frequency and pick 10. At 100M unique terms with hot prefixes like "th" or "how", a single query touches millions of rows. P99 latency: 500ms+ — five times our entire budget.
A single MySQL instance handles 5K-10K simple selects/sec. We need 200,000 — twenty times that ceiling. Replicas help, but every replica still pays the LIKE-scan cost per query. Throwing money at vertical scale tops out around 50K QPS for this access pattern.
Every search increments a row's frequency. At 60K searches/sec, that's 60K UPDATE statements/sec, all hot rows (popular queries get hammered). Lock contention pegs the DB at 100% CPU on locks alone — meanwhile reads slow to a crawl.
Two insights flip the design. Once you see them, the production architecture practically draws itself.
From §3, the entire trie is around 25-50 GB. A single modern server has 256 GB RAM. That means the read path never touches disk — every prefix lookup is a handful of pointer dereferences and an array copy. Latency drops from "hundreds of milliseconds" to "tens of microseconds" before even leaving the box.
Consequence: we don't need a database on the read path. The trie itself, sitting in process memory, IS the database.
Sarah typing "system design" right now does not need to influence the suggestions she sees on her next keystroke. Search popularity changes over hours and days, not seconds. We can collect every search into a log, run a batch job once an hour to compute fresh frequencies, and rebuild (or incrementally update) the trie offline.
Consequence: the read path and write path are completely decoupled. Reads serve from an immutable in-memory snapshot; writes append to a log and feed a background pipeline. No locks, no contention.
This split — fast read path on an in-memory trie, slow update path through an offline batch pipeline — is the central architectural idea. Everything else is plumbing to make it scale and survive failures.
Now the full picture. Every node is numbered; find its matching card below to see what it does and what would break without it. Two clearly-separated planes: the Read Plane (request path) and the Update Plane (offline pipeline).
Use the numbers in the diagram to find the matching card below. Each one answers what is this, why is it here, and what would break without it.
Sarah's browser, mobile app, or search bar widget. As she types, the client fires HTTP GET requests to /api/v1/suggestions?prefix=.... Critically, the client is not naive — it debounces keystrokes (waits ~50ms after the last keypress before firing), cancels in-flight requests when a new keypress lands, and pre-establishes the TCP/TLS connection on app open so the first request doesn't pay handshake latency.
Solves: reducing wasted backend traffic by ~70%. Without debouncing, typing "system design" would fire 13 separate suggestion requests; with it, maybe 3-5. The savings cascade — fewer requests means smaller fleets, smaller bills, and lower tail latency for everyone.
The traffic cop in front of the API server fleet. Distributes 200K req/sec across hundreds of stateless suggestion API pods, evicts unhealthy ones via 5-second health checks, terminates TLS so backends don't pay the crypto cost. Round-robin works fine because the API tier is stateless.
Solves: no single API server can handle the QPS or survive a deploy/crash. Without an LB, scaling would mean DNS round-robin (slow to react to failures) or putting one giant box in the path (single point of failure). LB gives both horizontal scaling and self-healing.
Stateless service that receives the user's prefix request and orchestrates the answer. The flow per request: parse prefix → check the cache for that exact prefix → on miss, ask the aggregator to query the right trie shards → return JSON. Also fires off an async event to the search logger when the user actually clicks a suggestion (that completion is the signal we care about).
Solves: isolating client-facing concerns (TLS, JSON, auth, rate limiting, request validation) from the actual lookup. Without this layer, every trie server would have to speak HTTP and handle malformed requests — coupling that hurts both. The API tier is also where personalization and geo-ranking get layered on top of the global trie result.
The heart of the system. Each trie server holds one shard of the trie in memory and answers prefix queries for that shard. Stateless from the client's perspective but stateful internally — the trie itself lives in process RAM, loaded from a snapshot at startup. A query is two steps: walk the trie to the prefix node (one pointer dereference per character), copy the node's pre-cached top-K array, return it. End-to-end on the trie server: under 1ms p99.
Solves: the latency budget. By holding the index in RAM and pre-caching top-K at every node, we replace "scan a billion rows and sort" with "follow 4 pointers and copy 10 strings". This is the architectural choice that makes the entire system possible.
When the trie is sharded by hash (say 8 shards), a prefix like "sy" might exist on multiple shards because different terms starting with "sy" live in different shards. The aggregator fans the prefix query out to all relevant trie shards in parallel, collects each shard's top-K, merges them by frequency, and returns the global top-K to the API server. With 8 shards × 10 results = 80 candidates merged into 10 — a tiny in-memory sort.
Solves: the cross-shard merge problem. Without an aggregator, the API server would either need to know the sharding scheme (tight coupling) or each trie shard would only see its slice of the data (incorrect results, missing popular terms from other shards). The aggregator hides sharding from upstream and guarantees a globally-correct top-K.
Every completed search (user typed a prefix and clicked a suggestion or pressed Enter) is logged to a Kafka topic as a structured event: {query, user_id, ts, geo}. Kafka is the buffer — it absorbs the firehose of 60K events/sec with bounded memory, persists them durably for days, and lets downstream consumers read at their own pace. The API server's call to Kafka is fire-and-forget — a 1ms async append, never blocking the user response.
Solves: decoupling the request path from the update path. Without a durable log, frequency updates would have to be applied synchronously, killing read latency and risking lost updates if the writer falls behind. Kafka lets us treat the firehose like water in a pipe — consumers downstream can be slow without backing up the source.
A batch job (Spark, MapReduce, Flink — pick your pipeline framework) runs every hour over the last hour of Kafka logs. It groups events by query string, counts occurrences, and emits a stream of (query, new_freq) records. To weight recent searches more heavily than ancient ones, frequencies are blended via an Exponential Moving Average: new_freq = α × this_hour + (1-α) × old_freq with α around 0.1.
Solves: turning a firehose of raw events into a small set of frequency updates. Without this aggregation, the trie builder would have to process 60K events/sec individually — overwhelming. With it, the builder gets a handful of millions of updates per hour, easily applied in a few minutes.
Takes the new frequency updates from the pipeline and applies them to a working copy of the trie. Two strategies (covered in §9): rebuild from scratch each hour (simple, slow), or incremental update (apply deltas to an existing trie, fast but trickier on top-K cache invalidation). After the update, the builder re-computes the top-K cache at every affected node — walking up from each changed leaf to the root, propagating the new frequency.
Solves: producing a fresh, queryable trie without ever locking the live trie. The builder writes to a brand-new snapshot, hands it off via S3, and trie servers atomically swap to the new copy. Old snapshot stays in memory until the swap completes — zero downtime.
Object storage holding serialized trie snapshots. Each snapshot is a compact binary representation (DFS-encoded with character + child count + recursive children — see §8) that a trie server can load directly into memory. New snapshots are versioned: trie/2026-05-07T14:00.bin. Trie servers poll for new versions; on finding one, they load it in the background, build the in-memory structure, then atomically swap pointers from old trie to new. Old snapshot purged after a few hours.
Solves: two things at once — distributing a fresh index to every trie server in the fleet, and recovering after a server crash (a restarted trie server pulls the latest snapshot and is back online in minutes). Without S3 in the middle, we'd need a bespoke push system or have every trie server rebuild its slice from raw logs after a crash — slow and complex.
An in-memory key-value cache holding prefix → top-K JSON for the hottest prefixes. Hits skip the trie entirely. The cache is sized small (a few hundred MB) because the long tail of unique prefixes is enormous — but a small set of head prefixes ("how to", "what is", "weather") accounts for a huge fraction of all queries. ML models can predict which prefixes will be hot (trending news, time-of-day patterns) and pre-warm the cache.
Solves: CPU on trie servers and tail latency. A cache hit is sub-millisecond Redis lookup; a trie miss is the full traversal + aggregator merge (still under 50ms but 50× more expensive). At 80% hit rate on hot prefixes we shed a huge fraction of trie traffic without growing the trie fleet.
Two real flows mapped to the numbered components above:
GET /api/v1/suggestions?prefix=syst.query, counts occurrences, blends with the previous hour's EMA frequencies. Outputs ~2M (query, freq) pairs.trie/2026-05-07T03:00.bin.Already introduced in §5; now we go inside. The crucial production trick is pre-storing top-K at every node — without it, the trie alone is not fast enough.
Each trie node holds:
TrieNode pseudocodeclass TrieNode { char ch; Map<Character, TrieNode> children; // up to 26-64 entries boolean isCompleteTerm; int frequency; // only meaningful if isCompleteTerm List<String> topK; // the 10 most popular terms // in the subtree rooted here }
The topK list at each node is the magic. Without it, a query like "give me top 10 for prefix 'a'" would have to walk the entire subtree under 'a' (potentially millions of terms), find frequencies, partial-sort them, return 10. Way too slow. With it, the same query is one pointer walk to the 'a' node and a constant-time read of its 10-element array.
Pre-storing 10 strings per node multiplies memory roughly 10×. With 100M unique terms and an average term length of 15 chars, the naive trie was a few GB; the top-K-cached trie is ~25-50 GB. Worth it — RAM is cheap, latency is not. Modern servers ship with 256 GB+; one box can hold the whole index.
When a term's frequency changes, you walk from its leaf node up to the root. At each ancestor, you re-compute that node's top-K from its children's top-K arrays — a merge of (children_count × 10) candidates → keep top 10. That's a tiny amortized cost per update, and crucially it's done offline in the trie builder, never on the request path.
The live trie lives in RAM. When a trie server crashes (or starts up fresh), how does it get its data back? Persisting snapshots to disk and pulling the latest snapshot from object storage on boot.
Walk the trie depth-first. For each node, write: (character, num_children, [recursive serialization of each child]). The complete-term flag and frequency get inlined into the same record. This produces a compact binary stream where deserializing is a single forward pass — no random access needed.
void serialize(TrieNode node, OutputStream out) { out.writeChar(node.ch); out.writeBoolean(node.isCompleteTerm); if (node.isCompleteTerm) out.writeInt(node.frequency); out.writeByte(node.children.size()); for (TrieNode child : node.children.values()) { serialize(child, out); } }
trie/2026-05-07T03:00.bin to S3 after each rebuild.Search trends move every hour. How do we get new frequencies into the live trie without ever taking the system down?
Keep two copies of the trie: a live copy serving traffic and a shadow copy being updated by the builder. When the shadow copy is fresh, atomically swap pointers — the next request hits the new trie, the old one is GC'd. Zero downtime, zero locks. Cost: 2× memory during the swap window.
When to use: when updates are batched (every hour). Simplest correct solution.
Maintain a delta queue of frequency updates. During quiet windows, take a brief read-write lock on the live trie, apply the deltas in batches (each batch is a few thousand updates), release. Top-K caches at affected ancestors get updated in place.
When to use: when freshness matters more than simplicity (sub-hour updates) and you can tolerate occasional latency blips during apply windows.
If you simply added each hour's count to the all-time count, frequencies would never reflect that "deflation" stopped trending and "ChatGPT" took over. Use an EMA so newer searches weigh more heavily:
EMA blendnew_freq = α × this_hour_count + (1 - α) × old_freq // α = 0.1 means: this hour weighs 10%, the past weighs 90% // with α = 0.1 a query's frequency halves over ~7 hours of zero searches // — fast enough to track trends, slow enough to ignore noise
Today the index fits on one box. But running the entire 200K QPS load on one box (no matter how big) is a single point of failure and a CPU bottleneck — we shard for throughput and fault tolerance, not just for storage.
"All terms starting with a-c go to shard 1, d-f to shard 2…" Easy to implement, but distribution is grossly uneven — "s" alone is dramatically more popular than "x" or "z" combined. The "s" shard melts while the "x-z" shard sits idle.
"Fill shard 1 until it's full, then start shard 2…" In practice, the early shards hold the most-popular terms (added first), so traffic pile-up on shard 1 is inevitable. Hot-shard problem doesn't go away.
Compute shard = consistent_hash(term) % N. Terms are uniformly distributed across N shards — no hotspot by design. Cost: a prefix query like "sy" must fan out to all shards (because terms starting with "sy" live wherever the hash sends them), then the aggregator merges results.
This is exactly why component ⑤ (aggregator) exists. The cost of fan-out is low because (a) shards run the query in parallel, so total latency = single-shard latency + aggregator merge cost (~5ms), and (b) each shard returns at most K results (10), so the aggregator's input is N × K (e.g., 80) candidates merged into K — trivial.
The trie is already fast (under 1ms per query inside a server), but a Redis cache in front of the trie shaves another 10× off latency on hot prefixes and sheds CPU off the trie fleet.
Per-prefix top-K: "how to" → ["how to tie a tie", "how to screenshot", ...]. The hot head of the prefix distribution dwarfs the long tail — caching the top 10K prefixes captures most queries with very little memory.
LRU. Trending prefixes naturally heat up and stay; yesterday's news cools and rolls out. Optionally use TTL too (60-120s) so cached top-K doesn't lag the trie's hourly refresh by more than a minute.
Use a model to predict which prefixes will be hot in the next hour — based on news trends, time-of-day patterns, recent rising queries. Pre-populate the cache before traffic arrives. Catches cold-start spikes (a breaking news story) that otherwise pummel the trie.
For ultra-popular prefixes ("how", "what", "the") the response is identical for every user (modulo personalization). A CDN at the edge can cache these short-prefix responses with a 30-second TTL. A user in Tokyo gets the dropdown from a Tokyo edge node in 15ms instead of a 200ms round-trip to the US data center.
Each trie shard is replicated multiple ways for both load and fault tolerance.
Each shard runs on 3+ replicas across availability zones. Reads are load-balanced across replicas (any replica can answer any query for its shard — they all have identical in-memory copies). Writes (snapshot loads) happen independently on each replica from the same S3 snapshot, so replicas converge naturally without a primary.
The aggregator routes each shard query via consistent hashing — same term always goes to the same primary replica first, falling back to secondaries if the primary is unhealthy. Consistent hashing also helps cache locality on the trie server side (if you add a per-replica memoization layer).
Three load-balancing layers: (1) Public LB from clients to API servers (round-robin, least-conn). (2) Internal LB from API servers to aggregator/cache (any healthy node). (3) Sharded LB from aggregator to trie servers (consistent hash on term + shard). Each layer uses health checks to evict unhealthy nodes within seconds.
What happens when each component dies?
One replica out of 3+ for that shard goes down. LB evicts it within 5 seconds via health checks. Remaining replicas absorb its share of traffic — a 33% bump on each, easily within capacity headroom. Replacement pod boots, pulls latest snapshot from S3, registers with LB. Total time-to-recover: a few minutes; user-visible impact: zero.
Rare (cross-AZ failure), but: aggregator detects the shard is unreachable, returns a partial top-K from the surviving shards. Quality degrades for prefixes that span the dead shard, but the system stays up. Recovery: spin up new replicas, snapshot loads in minutes.
Kafka is highly replicated and rarely fully fails. If it does: the API server's "log this click" call fails async — we lose some events. Acceptable: missing an hour of click events means the next trie rebuild is slightly stale, not broken. EMA smoothing means even a multi-hour outage barely moves frequencies.
No new snapshot gets produced this hour. Trie servers continue serving the previous snapshot — slightly stale, but fully functional. Operator gets paged, fixes the build job, next hour's run catches up. Critical insight: the read path doesn't depend on the update path being healthy right now.
Easy to forget that the most impactful latency fixes happen on the client. The server can be perfect and a naive client still feels broken.
Wait 50ms after the last keypress before firing the suggestion request. If a new keypress lands within 50ms, reset the timer. Result: typing "system" at normal speed fires 1-2 requests, not 6. Backend traffic drops 70%; user-perceived latency is unchanged.
When a new keypress fires a new request, cancel the previous one. Avoids the case where a slow request for "sy" returns after the user has typed "system" — showing stale suggestions. Use AbortController in the browser.
On app open / page load, pre-fetch the top 100 most-popular short prefixes (a few KB total). The first keystroke shows suggestions instantly from local storage — zero network round-trip for the most common case.
The first request to a new server pays a TCP 3-way handshake + TLS handshake (~100ms RTT × 3-4 = 300-400ms cold). Open a connection on app boot (an unused HTTP/2 stream) so the first keystroke uses a warm connection. With HTTP/3 / QUIC, this is even better — 0-RTT for resumed sessions.
The global trie tells us "syst" → ["system design interview", "system32 folder", ...]. But Sarah is a software engineer who searched "system design interview" 12 times last month. She should see her past searches first.
Each user has a personal frequency map of their past queries — stored compactly in a user-profile cache (Redis hash, key by user_id). When the API server returns suggestions, it does a final re-rank step:
score = 0.7 × global + 0.3 × personal).Personal data is added to the user profile asynchronously after each completed search. Storage is tiny (a few KB per active user) and the lookup is a single Redis call.