← Back to Design & Development
High-Level Design

Typeahead Suggestion

From "as you type, the right words appear" to a sharded in-memory trie that answers 60K queries/sec under 100ms — the architecture, the data structure, and the offline pipeline that keeps it fresh

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →
Step 1

What is Typeahead?

Sarah opens Google. She taps the "s" key. By the time her finger lifts, the search box already shows ten suggestions — "spotify", "starbucks near me", "stock market". She types "y", then "s", then "t" — each keypress refreshes the list in under 100 milliseconds. By the third character ("sys"), the suggestion she actually wanted ("system design interview") appears at the top. She taps it. The whole interaction took 1.2 seconds and she never finished typing. That tiny dropdown is what we're designing — and at Google scale, it serves 60,000 partial queries every second from a knowledge of a billion past searches.

Typeahead (also called search autocomplete or incremental search) is a deceptively simple feature: given a prefix the user has typed, return the top-K most relevant complete queries that start with that prefix. The simplicity is on the user's side — the system has to do it across a billion possible terms, ranked by popularity, in less time than the gap between two of Sarah's keypresses.

The two questions that drive every design decision below: (1) What data structure can return the top-K terms matching a prefix in microseconds, across 100M+ unique terms? (2) How do we update that data structure as billions of new searches happen daily, without ever taking the suggestion service offline?
Step 2

Requirements & Goals

Pin down the contract before drawing any boxes. Typeahead's user-facing behavior is simple, but the latency budget is brutal — anything slower than a keypress feels broken.

✅ Functional Requirements

  • As the user types a prefix, return the top 10 terms that start with that prefix
  • Suggestions ranked by frequency (how often that exact term has been searched)
  • Optional ranking signals — recency, user location, personalization
  • System learns continuously — newly-popular queries (a viral news event) should appear in suggestions within a reasonable window (an hour or so)

⚙️ Non-Functional Requirements

  • Real-time — suggestions must arrive in under 200ms end-to-end (UI feels alive only under ~100ms server-side budget)
  • Highly available — typeahead going down doesn't stop search, but a noticeably broken UX is shipped to a billion users
  • Continuously fresh — search trends change daily; the index must reflect new popularity within hours
Out of scope: spell correction ("teh" → "the"), fuzzy matching, query rewriting. These are entire systems on their own; typeahead is just prefix matching against a known vocabulary.
Step 3

Capacity Estimation & Constraints

Numbers drive every architectural choice — sharding, in-memory storage, cache sizing. Do them out loud, even rough. Assume Google-scale traffic.

Traffic estimates

Google sees roughly 5 billion searches per day. That averages to ~60,000 queries per second. But typeahead is hit on every keystroke, not every search — if the average user types 6 characters before tapping a suggestion, that's potentially 6× the QPS. With client-side debouncing (we discuss in §14), assume one suggestion request per 2-3 effective keypresses, so plan for ~200K typeahead requests/sec at peak.

Index size

Of those 5B daily queries, about 20% are unique — that's 1B unique queries/day. We don't index every one of them; the long tail is endless and most of it is junk. Index roughly the top 50% by frequency over a rolling window — call it ~100M unique terms worth indexing. Average query length is around 15 characters; at 2 bytes per character (covering Unicode) that's 30 bytes per term.

Searches/day

~5 billion

Google-scale baseline

Typeahead QPS

~200K req/sec

Per-keystroke after debouncing

Unique terms

~100M

Top 50% worth indexing

Bytes/term

~30 B

15 chars × 2 bytes

Total index size

Base index: 100M × 30 bytes = 3 GB. With overhead for the trie structure (parent/child pointers, top-K cache per node, frequency counters), realistic factor is roughly 5-10×, so ~25 GB total. Adjust for 2% growth/year and a few years of headroom — call it ~50 GB. That fits comfortably in a single server's RAM.

The "so what" of these numbers: the entire index fits in memory on one box. That single fact unlocks the architecture — we never have to touch disk on the request path, latency budget is microseconds for the lookup itself. The rest of the design (sharding, replication) is purely about throughput and fault tolerance, not about storage.
MetricValueWhy it matters
Typeahead QPS200K/sDrives shard count + replica count
Index in RAM~25-50 GBFits on one server — disk never touched on read
Latency budget<100 msPer keypress; UI feels broken above this
Top-K returned10Bounded — keeps payload tiny
Index freshness~hourlyAcceptable — search trends move slowly relative to keystrokes
Step 4

System APIs

One endpoint does 99% of the work. Define it now so the architecture knows its contract.

REST API surface
// The only hot endpoint — fired on every (debounced) keystroke
GET /api/v1/suggestions?prefix=syst&num=10

Headers:
  X-API-Dev-Key:  abc123...
  X-User-Id:      sarah_42        // optional — enables personalization
  X-Geo:          IN-DL           // optional — enables regional ranking

→ 200 OK
{
  "prefix": "syst",
  "suggestions": [
    "system design interview",
    "system design primer",
    "system of a down",
    "system32 folder",
    "systemctl restart",
    ...
  ]
}

// Ingestion (internal) — search clicks/completions feed back
POST /api/v1/log
{
  "query":   "system design interview",
  "user_id": "sarah_42",
  "ts":      1730884320,
  "geo":     "IN-DL"
}
→ 202 Accepted
API design notes. The response is a flat array of strings — no metadata per suggestion in the hot path, because every byte costs latency at this volume. We do not stream suggestions; the browser fires a fresh request per debounced keystroke and replaces the dropdown wholesale. Keep num bounded (max 10-20) so payload stays under a single TCP packet (~1500 bytes).

Step 5

Database / Data Structure — Why a Trie

The data-structure choice is the single most important decision in this entire design. Get this right and the rest falls out naturally; get it wrong and no amount of clever caching will save the latency budget.

Our access pattern is exactly: "given a prefix string, find all stored strings that start with that prefix, ranked by frequency, return top K". Three observations:

📐 Prefix-shaped queries

Every lookup is a prefix, never a full string. Hash maps are useless — they need exact keys. Sorted lists with binary search work but each lookup is O(log N × prefix_length) and you need a separate sort by frequency.

🔁 Massive prefix overlap

"system", "system design", "system design interview" share the prefix "system". A naive list-of-strings store the "system" prefix three times. A tree that shares prefixes stores it once.

⚡ Top-K must be precomputable

Walking every term under a prefix to sort by frequency on each request is too slow. The data structure needs to cache top-K at every node, so a query is one walk to the prefix node + read its top-K array.

Enter the trie (prefix tree)

A trie is a tree where each node represents a single character, the path from root to any node spells out a prefix, and a flag on the node marks "this prefix is a complete searchable term". Picture a phonebook where you've labeled every shelf with a starting letter, every sub-shelf with the second letter, and so on — to find names starting with "Bro" you walk three shelves deep, then read everything below.

flowchart TB R(("root")) S(("s")) Y(("y")) ST(("t")) STSUF(["[term: 'syst' freq=120]"]) E(("e")) EM(("m")) EMTERM(["[term: 'system' freq=8400]"]) SP((" ")) D(("d")) DE(("e")) DES(("s")) DESI(("i")) DESIG(("g")) DESIGN(["[term: 'system design' freq=2200]"]) R --> S --> Y --> ST ST --> STSUF ST --> E --> EM --> EMTERM EM --> SP --> D --> DE --> DES --> DESI --> DESIG --> DESIGN style R fill:#171d27,stroke:#7b8599,color:#d4dae5 style EMTERM fill:#171d27,stroke:#38b265,color:#38b265 style STSUF fill:#171d27,stroke:#38b265,color:#38b265 style DESIGN fill:#171d27,stroke:#38b265,color:#38b265

To answer "give me top 10 terms starting with syst": walk root → s → y → s → t (4 character-hops), then read this node's precomputed top-K array. Total cost: 4 pointer dereferences plus one array read. That's microseconds, not milliseconds — easily fitting our 100ms budget with room to spare for network and serialization.

Why the index lives in memory and not in a database: the entire trie is ~25-50 GB. A request crosses dozens of nodes per lookup. If those nodes lived on disk (or in Postgres rows), each hop would be a disk seek — 10ms × 4 hops = blown latency budget on the first character. RAM walks at nanosecond speed; we keep the live trie in memory and persist snapshots to object storage only for crash recovery.
Step 6 · CORE

High-Level Architecture — From Naive to Production

Build the architecture in three passes: simplest plausible thing, why it breaks, and the production shape where every box justifies itself with a failure it removes. This is the section that wins or loses the interview.

Pass 1 — The naive design (and why it breaks)

Sketch the simplest possible system. One app server, one SQL database with a searches(term, freq) table. To answer "suggestions for prefix syst", run a SQL LIKE query.

flowchart LR C["Client"] --> APP["App Server"] APP --> DB[("MySQL — searches table")]
The naive query
SELECT term FROM searches
WHERE term LIKE 'syst%'
ORDER BY freq DESC
LIMIT 10;

Three concrete failures emerge the moment Sarah and her billion friends start typing:

💥 LIKE 'prefix%' is slow at scale

Even with a B-tree index, a prefix-LIKE on a billion-row table reads thousands of matching rows just to sort them by frequency and pick 10. At 100M unique terms with hot prefixes like "th" or "how", a single query touches millions of rows. P99 latency: 500ms+ — five times our entire budget.

💥 200K QPS kills one DB

A single MySQL instance handles 5K-10K simple selects/sec. We need 200,000 — twenty times that ceiling. Replicas help, but every replica still pays the LIKE-scan cost per query. Throwing money at vertical scale tops out around 50K QPS for this access pattern.

💥 Updating frequencies is a write storm

Every search increments a row's frequency. At 60K searches/sec, that's 60K UPDATE statements/sec, all hot rows (popular queries get hammered). Lock contention pegs the DB at 100% CPU on locks alone — meanwhile reads slow to a crawl.

Pass 2 — The mental model: in-memory trie + offline updates

Two insights flip the design. Once you see them, the production architecture practically draws itself.

💡 Insight 1 — The index is small enough to fit in RAM

From §3, the entire trie is around 25-50 GB. A single modern server has 256 GB RAM. That means the read path never touches disk — every prefix lookup is a handful of pointer dereferences and an array copy. Latency drops from "hundreds of milliseconds" to "tens of microseconds" before even leaving the box.

Consequence: we don't need a database on the read path. The trie itself, sitting in process memory, IS the database.

💡 Insight 2 — Frequencies change slowly

Sarah typing "system design" right now does not need to influence the suggestions she sees on her next keystroke. Search popularity changes over hours and days, not seconds. We can collect every search into a log, run a batch job once an hour to compute fresh frequencies, and rebuild (or incrementally update) the trie offline.

Consequence: the read path and write path are completely decoupled. Reads serve from an immutable in-memory snapshot; writes append to a log and feed a background pipeline. No locks, no contention.

This split — fast read path on an in-memory trie, slow update path through an offline batch pipeline — is the central architectural idea. Everything else is plumbing to make it scale and survive failures.

Pass 3 — The production shape

Now the full picture. Every node is numbered; find its matching card below to see what it does and what would break without it. Two clearly-separated planes: the Read Plane (request path) and the Update Plane (offline pipeline).

flowchart TB CL["① Client — debounced typing"] subgraph EDGE["Edge"] CDN["CDN"] LB["② Load Balancer"] end subgraph READ["Read Plane — request path"] API["③ Suggestion API Server"] CACHE["⑩ Cache — Redis"] TRIE1["④ Trie Server — shard A"] TRIE2["④ Trie Server — shard B"] TRIE3["④ Trie Server — shard C"] AGG["⑤ Aggregator"] end subgraph WRITE["Update Plane — offline pipeline"] LOG["⑥ Search Logger — Kafka"] PIPE["⑦ Frequency Aggregation — Spark"] BUILD["⑧ Trie Builder"] SNAP[("⑨ Trie Snapshot Storage — S3")] end CL --> CDN --> LB --> API API --> CACHE CACHE -.miss.-> AGG AGG --> TRIE1 AGG --> TRIE2 AGG --> TRIE3 API -.search clicks.-> LOG LOG --> PIPE --> BUILD BUILD --> SNAP SNAP -.periodic reload.-> TRIE1 SNAP -.periodic reload.-> TRIE2 SNAP -.periodic reload.-> TRIE3 style CL fill:#e8743b,stroke:#e8743b,color:#fff style CDN fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style LB fill:#171d27,stroke:#9b72cf,color:#d4dae5 style API fill:#171d27,stroke:#4a90d9,color:#d4dae5 style CACHE fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style TRIE1 fill:#171d27,stroke:#38b265,color:#d4dae5 style TRIE2 fill:#171d27,stroke:#38b265,color:#d4dae5 style TRIE3 fill:#171d27,stroke:#38b265,color:#d4dae5 style AGG fill:#171d27,stroke:#9b72cf,color:#d4dae5 style LOG fill:#171d27,stroke:#d4a838,color:#d4dae5 style PIPE fill:#171d27,stroke:#d4a838,color:#d4dae5 style BUILD fill:#171d27,stroke:#d4a838,color:#d4dae5 style SNAP fill:#171d27,stroke:#e05252,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram to find the matching card below. Each one answers what is this, why is it here, and what would break without it.

Client — Browser / App

Sarah's browser, mobile app, or search bar widget. As she types, the client fires HTTP GET requests to /api/v1/suggestions?prefix=.... Critically, the client is not naive — it debounces keystrokes (waits ~50ms after the last keypress before firing), cancels in-flight requests when a new keypress lands, and pre-establishes the TCP/TLS connection on app open so the first request doesn't pay handshake latency.

Solves: reducing wasted backend traffic by ~70%. Without debouncing, typing "system design" would fire 13 separate suggestion requests; with it, maybe 3-5. The savings cascade — fewer requests means smaller fleets, smaller bills, and lower tail latency for everyone.

Load Balancer

The traffic cop in front of the API server fleet. Distributes 200K req/sec across hundreds of stateless suggestion API pods, evicts unhealthy ones via 5-second health checks, terminates TLS so backends don't pay the crypto cost. Round-robin works fine because the API tier is stateless.

Solves: no single API server can handle the QPS or survive a deploy/crash. Without an LB, scaling would mean DNS round-robin (slow to react to failures) or putting one giant box in the path (single point of failure). LB gives both horizontal scaling and self-healing.

Suggestion API Server

Stateless service that receives the user's prefix request and orchestrates the answer. The flow per request: parse prefix → check the cache for that exact prefix → on miss, ask the aggregator to query the right trie shards → return JSON. Also fires off an async event to the search logger when the user actually clicks a suggestion (that completion is the signal we care about).

Solves: isolating client-facing concerns (TLS, JSON, auth, rate limiting, request validation) from the actual lookup. Without this layer, every trie server would have to speak HTTP and handle malformed requests — coupling that hurts both. The API tier is also where personalization and geo-ranking get layered on top of the global trie result.

Trie Server Cluster

The heart of the system. Each trie server holds one shard of the trie in memory and answers prefix queries for that shard. Stateless from the client's perspective but stateful internally — the trie itself lives in process RAM, loaded from a snapshot at startup. A query is two steps: walk the trie to the prefix node (one pointer dereference per character), copy the node's pre-cached top-K array, return it. End-to-end on the trie server: under 1ms p99.

Solves: the latency budget. By holding the index in RAM and pre-caching top-K at every node, we replace "scan a billion rows and sort" with "follow 4 pointers and copy 10 strings". This is the architectural choice that makes the entire system possible.

Aggregator

When the trie is sharded by hash (say 8 shards), a prefix like "sy" might exist on multiple shards because different terms starting with "sy" live in different shards. The aggregator fans the prefix query out to all relevant trie shards in parallel, collects each shard's top-K, merges them by frequency, and returns the global top-K to the API server. With 8 shards × 10 results = 80 candidates merged into 10 — a tiny in-memory sort.

Solves: the cross-shard merge problem. Without an aggregator, the API server would either need to know the sharding scheme (tight coupling) or each trie shard would only see its slice of the data (incorrect results, missing popular terms from other shards). The aggregator hides sharding from upstream and guarantees a globally-correct top-K.

Search Logger — Kafka

Every completed search (user typed a prefix and clicked a suggestion or pressed Enter) is logged to a Kafka topic as a structured event: {query, user_id, ts, geo}. Kafka is the buffer — it absorbs the firehose of 60K events/sec with bounded memory, persists them durably for days, and lets downstream consumers read at their own pace. The API server's call to Kafka is fire-and-forget — a 1ms async append, never blocking the user response.

Solves: decoupling the request path from the update path. Without a durable log, frequency updates would have to be applied synchronously, killing read latency and risking lost updates if the writer falls behind. Kafka lets us treat the firehose like water in a pipe — consumers downstream can be slow without backing up the source.

Frequency Aggregation Pipeline — Spark

A batch job (Spark, MapReduce, Flink — pick your pipeline framework) runs every hour over the last hour of Kafka logs. It groups events by query string, counts occurrences, and emits a stream of (query, new_freq) records. To weight recent searches more heavily than ancient ones, frequencies are blended via an Exponential Moving Average: new_freq = α × this_hour + (1-α) × old_freq with α around 0.1.

Solves: turning a firehose of raw events into a small set of frequency updates. Without this aggregation, the trie builder would have to process 60K events/sec individually — overwhelming. With it, the builder gets a handful of millions of updates per hour, easily applied in a few minutes.

Trie Builder

Takes the new frequency updates from the pipeline and applies them to a working copy of the trie. Two strategies (covered in §9): rebuild from scratch each hour (simple, slow), or incremental update (apply deltas to an existing trie, fast but trickier on top-K cache invalidation). After the update, the builder re-computes the top-K cache at every affected node — walking up from each changed leaf to the root, propagating the new frequency.

Solves: producing a fresh, queryable trie without ever locking the live trie. The builder writes to a brand-new snapshot, hands it off via S3, and trie servers atomically swap to the new copy. Old snapshot stays in memory until the swap completes — zero downtime.

Trie Snapshot Storage — S3

Object storage holding serialized trie snapshots. Each snapshot is a compact binary representation (DFS-encoded with character + child count + recursive children — see §8) that a trie server can load directly into memory. New snapshots are versioned: trie/2026-05-07T14:00.bin. Trie servers poll for new versions; on finding one, they load it in the background, build the in-memory structure, then atomically swap pointers from old trie to new. Old snapshot purged after a few hours.

Solves: two things at once — distributing a fresh index to every trie server in the fleet, and recovering after a server crash (a restarted trie server pulls the latest snapshot and is back online in minutes). Without S3 in the middle, we'd need a bespoke push system or have every trie server rebuild its slice from raw logs after a crash — slow and complex.

Cache — Redis

An in-memory key-value cache holding prefix → top-K JSON for the hottest prefixes. Hits skip the trie entirely. The cache is sized small (a few hundred MB) because the long tail of unique prefixes is enormous — but a small set of head prefixes ("how to", "what is", "weather") accounts for a huge fraction of all queries. ML models can predict which prefixes will be hot (trending news, time-of-day patterns) and pre-warm the cache.

Solves: CPU on trie servers and tail latency. A cache hit is sub-millisecond Redis lookup; a trie miss is the full traversal + aggregator merge (still under 50ms but 50× more expensive). At 80% hit rate on hot prefixes we shed a huge fraction of trie traffic without growing the trie fleet.

Concrete walkthrough — Sarah types, system updates

Two real flows mapped to the numbered components above:

📖 Read flow — Sarah types "syst" at 14:02:06

  1. Sarah taps "s". The client ① starts a 50ms debounce timer.
  2. Before 50ms passes, she taps "y". Timer resets.
  3. "s" again, then "t". After 50ms with no new keystroke, the client fires GET /api/v1/suggestions?prefix=syst.
  4. The load balancer ② routes to a healthy API server ③.
  5. API server checks the cache ⑩ for "syst". Hit? Yes — "syst" is hot today. Return cached top-K in 8ms total.
  6. Cache miss path: API server asks the aggregator ⑤ → fans out to all 8 trie servers ④ in parallel → each walks its in-memory trie (4 pointer hops to the "syst" node), copies its top-10 array → aggregator merges 80 candidates → returns global top-10. Total: ~25ms.
  7. API server returns JSON. The dropdown updates in Sarah's browser.
  8. If Sarah clicks "system design interview", the API server fires off a click event to the search logger ⑥ async.

🔄 Update flow — system rebuild at 03:00 UTC

  1. Cron triggers the frequency aggregation pipeline ⑦. Spark reads the last hour of Kafka ⑥ logs (~6 GB of click events).
  2. Spark groups by query, counts occurrences, blends with the previous hour's EMA frequencies. Outputs ~2M (query, freq) pairs.
  3. The trie builder ⑧ takes these updates and applies them to a working copy of the trie. For each updated term, it walks up to the root re-computing top-K caches at every ancestor node touched.
  4. Builder serializes the new trie via DFS into a binary file, uploads to S3 ⑨ as trie/2026-05-07T03:00.bin.
  5. Each trie server ④ polls S3, sees the new version, downloads it (network: ~10 seconds for 50GB on a 10Gbps link), parses it into a new in-memory trie (~30 seconds), then atomically swaps the pointer from old trie to new. Old trie GC'd.
  6. From the next request onward, that trie server serves from the freshest data. Sarah gets up-to-date suggestions on her next search session.
So what: the architecture is built on three insights — (1) the index fits in RAM so reads never touch disk, (2) writes go to a log and update the trie offline so reads and writes never contend, (3) top-K is precomputed at every trie node so a query is a walk plus an array copy. Every component in the diagram exists to make one of those three things scale, stay fresh, or survive a failure.
Step 7

The Trie Data Structure — A Closer Look

Already introduced in §5; now we go inside. The crucial production trick is pre-storing top-K at every node — without it, the trie alone is not fast enough.

Node structure

Each trie node holds:

TrieNode pseudocode
class TrieNode {
  char ch;
  Map<Character, TrieNode> children;  // up to 26-64 entries
  boolean isCompleteTerm;
  int frequency;                          // only meaningful if isCompleteTerm
  List<String> topK;                       // the 10 most popular terms
                                            // in the subtree rooted here
}

The topK list at each node is the magic. Without it, a query like "give me top 10 for prefix 'a'" would have to walk the entire subtree under 'a' (potentially millions of terms), find frequencies, partial-sort them, return 10. Way too slow. With it, the same query is one pointer walk to the 'a' node and a constant-time read of its 10-element array.

Memory vs. speed trade-off

Pre-storing 10 strings per node multiplies memory roughly 10×. With 100M unique terms and an average term length of 15 chars, the naive trie was a few GB; the top-K-cached trie is ~25-50 GB. Worth it — RAM is cheap, latency is not. Modern servers ship with 256 GB+; one box can hold the whole index.

Building / updating the top-K cache

When a term's frequency changes, you walk from its leaf node up to the root. At each ancestor, you re-compute that node's top-K from its children's top-K arrays — a merge of (children_count × 10) candidates → keep top 10. That's a tiny amortized cost per update, and crucially it's done offline in the trie builder, never on the request path.

The analogy: a trie with top-K caches is like a phonebook where every page header lists the most-called-from-that-section names. To find "top callers starting with Br" you flip to the Br page and read the header — you don't scan the whole section.
Step 8

Permanent Storage of the Trie

The live trie lives in RAM. When a trie server crashes (or starts up fresh), how does it get its data back? Persisting snapshots to disk and pulling the latest snapshot from object storage on boot.

Serialization format

Walk the trie depth-first. For each node, write: (character, num_children, [recursive serialization of each child]). The complete-term flag and frequency get inlined into the same record. This produces a compact binary stream where deserializing is a single forward pass — no random access needed.

DFS serialization (pseudocode)
void serialize(TrieNode node, OutputStream out) {
  out.writeChar(node.ch);
  out.writeBoolean(node.isCompleteTerm);
  if (node.isCompleteTerm) out.writeInt(node.frequency);
  out.writeByte(node.children.size());
  for (TrieNode child : node.children.values()) {
    serialize(child, out);
  }
}

Snapshot lifecycle

  • Trie builder writes trie/2026-05-07T03:00.bin to S3 after each rebuild.
  • Trie servers poll S3 every minute for new versions. On finding one, they download in background, parse, and atomically swap.
  • On crash recovery, a restarted trie server pulls the latest snapshot from S3 and is serving traffic in minutes — no need to replay logs from the beginning of time.
  • Old snapshots are retained for 24-48 hours then garbage-collected, so we can roll back to the previous version if a bad rebuild ships.
Key property: the snapshot is the source of truth for "what trie should the fleet currently serve". The live in-memory trie is a fast cache of that snapshot. This separation means losing a trie server is a non-event — replacement comes online from the same snapshot in minutes.
Step 9

Updating the Trie

Search trends move every hour. How do we get new frequencies into the live trie without ever taking the system down?

🔄 Strategy A — Master-slave swap

Keep two copies of the trie: a live copy serving traffic and a shadow copy being updated by the builder. When the shadow copy is fresh, atomically swap pointers — the next request hits the new trie, the old one is GC'd. Zero downtime, zero locks. Cost: 2× memory during the swap window.

When to use: when updates are batched (every hour). Simplest correct solution.

⚡ Strategy B — Incremental delta apply

Maintain a delta queue of frequency updates. During quiet windows, take a brief read-write lock on the live trie, apply the deltas in batches (each batch is a few thousand updates), release. Top-K caches at affected ancestors get updated in place.

When to use: when freshness matters more than simplicity (sub-hour updates) and you can tolerate occasional latency blips during apply windows.

Frequency weighting — why Exponential Moving Average

If you simply added each hour's count to the all-time count, frequencies would never reflect that "deflation" stopped trending and "ChatGPT" took over. Use an EMA so newer searches weigh more heavily:

EMA blend
new_freq = α × this_hour_count + (1 - α) × old_freq

// α = 0.1 means: this hour weighs 10%, the past weighs 90%
// with α = 0.1 a query's frequency halves over ~7 hours of zero searches
// — fast enough to track trends, slow enough to ignore noise
The interview move: mention EMA explicitly. It shows you've thought about freshness vs. stability — a flat counter would be polluted forever by a single viral spike, but EMA naturally fades old signals.
Step 10

Data Partitioning

Today the index fits on one box. But running the entire 200K QPS load on one box (no matter how big) is a single point of failure and a CPU bottleneck — we shard for throughput and fault tolerance, not just for storage.

❌ Range-based by first letter

"All terms starting with a-c go to shard 1, d-f to shard 2…" Easy to implement, but distribution is grossly uneven — "s" alone is dramatically more popular than "x" or "z" combined. The "s" shard melts while the "x-z" shard sits idle.

❌ Max-capacity-based

"Fill shard 1 until it's full, then start shard 2…" In practice, the early shards hold the most-popular terms (added first), so traffic pile-up on shard 1 is inevitable. Hot-shard problem doesn't go away.

✅ Hash-based on full term

Compute shard = consistent_hash(term) % N. Terms are uniformly distributed across N shards — no hotspot by design. Cost: a prefix query like "sy" must fan out to all shards (because terms starting with "sy" live wherever the hash sends them), then the aggregator merges results.

The aggregator's role under hash sharding

This is exactly why component ⑤ (aggregator) exists. The cost of fan-out is low because (a) shards run the query in parallel, so total latency = single-shard latency + aggregator merge cost (~5ms), and (b) each shard returns at most K results (10), so the aggregator's input is N × K (e.g., 80) candidates merged into K — trivial.

Why hash-on-term beats every alternative: it gives uniform load distribution and uniform shard size by construction. The fan-out cost is fully recouped by the parallelism. Real production systems (Google, Bing) all hash-shard their typeahead indexes for exactly this reason.
Step 11

Cache

The trie is already fast (under 1ms per query inside a server), but a Redis cache in front of the trie shaves another 10× off latency on hot prefixes and sheds CPU off the trie fleet.

📦 What to cache

Per-prefix top-K: "how to" → ["how to tie a tie", "how to screenshot", ...]. The hot head of the prefix distribution dwarfs the long tail — caching the top 10K prefixes captures most queries with very little memory.

♻️ Eviction

LRU. Trending prefixes naturally heat up and stay; yesterday's news cools and rolls out. Optionally use TTL too (60-120s) so cached top-K doesn't lag the trie's hourly refresh by more than a minute.

🔮 Pre-warming with ML

Use a model to predict which prefixes will be hot in the next hour — based on news trends, time-of-day patterns, recent rising queries. Pre-populate the cache before traffic arrives. Catches cold-start spikes (a breaking news story) that otherwise pummel the trie.

CDN at the edge

For ultra-popular prefixes ("how", "what", "the") the response is identical for every user (modulo personalization). A CDN at the edge can cache these short-prefix responses with a 30-second TTL. A user in Tokyo gets the dropdown from a Tokyo edge node in 15ms instead of a 200ms round-trip to the US data center.

Caching nuance: personalization breaks shareable caching. We cache only the global top-K per prefix and apply personalization (re-rank or inject user history) at the API server layer ③ after the cache hit. That way 95% of users see the cached answer and the personalization step costs ~1ms on top.
Step 12

Replication & Load Balancing

Each trie shard is replicated multiple ways for both load and fault tolerance.

📋 Replication of trie shards

Each shard runs on 3+ replicas across availability zones. Reads are load-balanced across replicas (any replica can answer any query for its shard — they all have identical in-memory copies). Writes (snapshot loads) happen independently on each replica from the same S3 snapshot, so replicas converge naturally without a primary.

🚦 Load balancing strategy

The aggregator routes each shard query via consistent hashing — same term always goes to the same primary replica first, falling back to secondaries if the primary is unhealthy. Consistent hashing also helps cache locality on the trie server side (if you add a per-replica memoization layer).

LB at every tier

Three load-balancing layers: (1) Public LB from clients to API servers (round-robin, least-conn). (2) Internal LB from API servers to aggregator/cache (any healthy node). (3) Sharded LB from aggregator to trie servers (consistent hash on term + shard). Each layer uses health checks to evict unhealthy nodes within seconds.

Step 13

Fault Tolerance

What happens when each component dies?

💥 Trie server crash

One replica out of 3+ for that shard goes down. LB evicts it within 5 seconds via health checks. Remaining replicas absorb its share of traffic — a 33% bump on each, easily within capacity headroom. Replacement pod boots, pulls latest snapshot from S3, registers with LB. Total time-to-recover: a few minutes; user-visible impact: zero.

💥 All replicas of a shard die

Rare (cross-AZ failure), but: aggregator detects the shard is unreachable, returns a partial top-K from the surviving shards. Quality degrades for prefixes that span the dead shard, but the system stays up. Recovery: spin up new replicas, snapshot loads in minutes.

💥 Search logger / Kafka cluster down

Kafka is highly replicated and rarely fully fails. If it does: the API server's "log this click" call fails async — we lose some events. Acceptable: missing an hour of click events means the next trie rebuild is slightly stale, not broken. EMA smoothing means even a multi-hour outage barely moves frequencies.

💥 Trie builder fails

No new snapshot gets produced this hour. Trie servers continue serving the previous snapshot — slightly stale, but fully functional. Operator gets paged, fixes the build job, next hour's run catches up. Critical insight: the read path doesn't depend on the update path being healthy right now.

Step 14

Typeahead Client Optimizations

Easy to forget that the most impactful latency fixes happen on the client. The server can be perfect and a naive client still feels broken.

⏱️ Debouncing

Wait 50ms after the last keypress before firing the suggestion request. If a new keypress lands within 50ms, reset the timer. Result: typing "system" at normal speed fires 1-2 requests, not 6. Backend traffic drops 70%; user-perceived latency is unchanged.

🚫 Cancel in-flight requests

When a new keypress fires a new request, cancel the previous one. Avoids the case where a slow request for "sy" returns after the user has typed "system" — showing stale suggestions. Use AbortController in the browser.

🌡️ Pre-warm popular completions

On app open / page load, pre-fetch the top 100 most-popular short prefixes (a few KB total). The first keystroke shows suggestions instantly from local storage — zero network round-trip for the most common case.

🔌 Establish TCP/TLS early

The first request to a new server pays a TCP 3-way handshake + TLS handshake (~100ms RTT × 3-4 = 300-400ms cold). Open a connection on app boot (an unused HTTP/2 stream) so the first keystroke uses a warm connection. With HTTP/3 / QUIC, this is even better — 0-RTT for resumed sessions.

The numbers matter: a 50ms debounce + warm connection + cancel-in-flight reduces effective per-keystroke latency from "300ms cold + 50ms wasted requests" to "60ms steady-state". The user feels instant; the server fleet shrinks; cloud bills drop. Same backend, dramatically better UX.
Step 15

Personalization

The global trie tells us "syst" → ["system design interview", "system32 folder", ...]. But Sarah is a software engineer who searched "system design interview" 12 times last month. She should see her past searches first.

Per-user history

Each user has a personal frequency map of their past queries — stored compactly in a user-profile cache (Redis hash, key by user_id). When the API server returns suggestions, it does a final re-rank step:

  • Take the global top-K from the trie.
  • Look up the user's personal frequency map for matches against the prefix.
  • Blend personal scores with global scores (e.g., score = 0.7 × global + 0.3 × personal).
  • Re-sort and return.

Personal data is added to the user profile asynchronously after each completed search. Storage is tiny (a few KB per active user) and the lookup is a single Redis call.

Caching implication: personalized responses can't be cached at the CDN. We cache the global top-K (sharable across all users), and apply personalization at the API server after the cache hit. Best of both worlds: most of the work is shared, the user-specific bit is ~1ms on top.
Step 16

Interview Q&A

Why a trie over a SQL LIKE 'prefix%' query?
Latency, throughput, and the shape of the query. A SQL LIKE on a billion-row table reads thousands of matching rows just to sort by frequency — 500ms+ at scale. A trie answers the same query in microseconds: walk the prefix path (one pointer per character), read the precomputed top-K array. The trie also shares common prefixes structurally, so 100M unique terms compress to ~25 GB instead of multiple times that. SQL is a swiss-army knife optimized for joins and ad-hoc filters; this is a single hyper-specific query that wants a hyper-specific structure.
How does the system update without downtime?
Master-slave swap. The trie builder writes a brand-new snapshot to S3. Each trie server downloads the new snapshot in background, builds the new in-memory trie alongside the live one, then atomically swaps a pointer from old to new. Old trie is GC'd after the swap. From the client's perspective, one request hits the old trie and the next hits the new — zero downtime, zero locks. Cost: 2× memory during the swap window, easily affordable.
How do you handle a hot prefix like "how to"?
Three layers absorb the spike before it reaches the trie. (1) CDN caches the response at edge POPs with 30-second TTL — Tokyo users hit Tokyo edges, never the origin. (2) Redis cache at the API tier holds top-K for hot prefixes — a hit is sub-ms with no trie touch. (3) Replication of trie shards (3+ replicas each) means even a cache miss spreads load across replicas. End result: the hottest prefix on the internet costs us almost nothing on the trie servers.
What if the trie doesn't fit in memory anymore?
Shard further and possibly tier the trie. First lever: increase shard count — instead of 8 shards each holding ~5GB, go to 32 shards each holding ~1.5GB. Each shard fits trivially. Second lever: tiered trie — keep the hot subtree (popular prefixes, short prefixes) fully in RAM, push the cold long tail (rare 6+ char prefixes) to a fast SSD-backed key-value store. Reads from cold tier are slower (~5ms vs <1ms) but trigger only on rare prefixes where the user is already typing slowly. Production systems rarely need this — sharding alone scales for years.
How would you add personalization?
Re-rank globally-cached results with per-user data. Store each user's past-query frequency map in Redis (keyed by user_id). On each suggestion request, fetch the global top-K from the trie/cache, then fetch the user's personal frequencies for the prefix, blend the two scores (e.g., 70% global + 30% personal), and re-sort. The personalization step adds ~1ms per request, doesn't affect the trie or cache (which stay shareable across users), and gives a noticeably better UX.
Push vs. pull for trie updates — which and why?
Pull, via S3 snapshots. Push (broadcasting deltas to every trie server) sounds faster but creates coupling and back-pressure problems — slow servers slow the whole pipeline, schema mismatches break clusters silently, scaling out the trie fleet means provisioning more push fan-out. Pull from S3 is dead simple: each trie server polls for the latest version, downloads, swaps. Adding a new trie server requires zero pipeline changes — it just starts polling. The only cost is a 1-2 minute latency between snapshot publish and fleet refresh, which is well within our hourly freshness budget.
Why fan out a query to all shards instead of routing it to one?
Because we hash-shard by full term, not by prefix. Hash-on-term gives uniform shard load — no hotspots — but means terms starting with "sy" can live on any shard. So a prefix query must ask every shard "what's your top-K for sy?" and the aggregator merges. The fan-out cost is paid once in parallel (~5ms aggregator merge on top of single-shard latency), and the alternative (prefix-based sharding) creates catastrophic hot-shard problems where all "s" traffic crushes one box.
How do you bootstrap suggestions for a brand-new search engine with no history?
Seed with curated content + Wikipedia titles + a query log from a related corpus. Day 1: load Wikipedia article titles, common nouns from a dictionary, popular product names — gives a baseline trie of millions of terms. Day 2 onward: real user queries flow through the logger and frequencies start adjusting via EMA. Within a week the curated seed has been overwritten by actual usage patterns. The architecture doesn't change; only the initial trie content does.
The one-line summary the interviewer remembers: "It's a hash-sharded in-memory trie with top-K cached at every node, fronted by a Redis cache and CDN, fed by an hourly offline pipeline that aggregates a Kafka log of search clicks via EMA into a fresh snapshot in S3 — read path and update path fully decoupled so reads stay sub-100ms while updates roll out without downtime."