← Back to Design & Development
High-Level Design

Twitter Search

Searching 730 billion tweets in under 200ms — the inverted index, the shard fan-out, and the architecture that makes a multi-petabyte index feel instant

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →
Step 1

What is Twitter Search?

Imagine Sarah, sitting on her couch during the World Cup final, types "world cup final" into the Twitter search bar. Within the next 200 milliseconds, Twitter must dig through 730 billion tweets ever posted, find the few hundred thousand that mention all three of those words, rank them by how recent and how popular they are, and return the top 20 to her phone. The data being searched is multi-petabyte; the answer must feel instant. That's the system we're designing.

The naive approach — "scan every tweet looking for the words" — would take hours per query and saturate every database in the cluster. Twitter Search instead pre-builds a special data structure called an inverted index that maps every English word to the list of tweets that contain it. When Sarah searches, we don't search the tweets — we search the index. That single architectural choice is what makes the system possible.

The two questions that drive every design decision below: (1) How do we build and continuously update an index over 730 billion tweets without falling behind the firehose of new tweets? (2) How do we serve thousands of multi-word search queries per second over a 21TB index, with under-200ms latency, when each query may need to consult dozens of machines?
Step 2

Requirements & Goals

Before drawing a single box, pin down what the system must do. In an interview, asking these questions early shows you're not just reaching for a memorized solution.

✅ Functional Requirements

  • Full-text search over the entire history of tweets (730B and counting)
  • Multi-word queries with implicit AND between terms (and explicit OR/NOT operators)
  • Sort results by recency or by relevance (default)
  • Pagination — return 20 results per page, allow loading more
  • Optional filters: language, time window, user, has-media

⚙️ Non-Functional Requirements

  • Fast — under 200ms p99 end-to-end search latency
  • Scalable — handle billions of tweets and grow at 400M new tweets/day
  • Highly available — search is one of the top 3 features on Twitter; an outage is front-page news
  • Eventually consistent — a tweet posted at 14:02 should be searchable by 14:02:30 (the index lag budget is single-digit seconds)
The hard part isn't search — it's the index. Building an inverted index over 21TB of words, keeping it up-to-date with 4,600 new tweets per second, sharding it across 150+ servers, and merging top-K results back at query time — that's the architecture. The actual search at query time is then "look up a word in a hash map, intersect a few lists, rank, return".
Step 3

Capacity Estimation & Constraints

Numbers are not optional in HLD. They drive every architectural choice — sharding count, server RAM, fan-out width — so do them out loud, even if rough.

Traffic estimates

Assume 1.5 billion total users, of which 800 million are daily active. Twitter sees 400 million tweets per day (roughly the public-tweet rate at peak Twitter scale).

Tweets / sec

~4,600 / sec

400M / 86,400

Searches / day

500M / day

~625 per active user / month

Searches / sec

~5,800 QPS

500M / 86,400

Tweet size

300 bytes

text + metadata, no media

Storage estimate (5 years)

400M tweets/day × 300 bytes = 120 GB/day = ~1.4 MB/sec of new tweet data. Over 5 years that's ~200 TB of tweet text.

Apply an 80% capacity model (always keep 20% headroom) plus 2× replication for durability: 200 TB / 0.8 × 2 ≈ 500 TB total. Spread across servers with 4TB SSDs each, we need ~125 storage servers.

Index size estimate (the real workload)

The index is separate from the raw tweets. Assume 500K commonly-used English words plus 200K proper nouns/entities — call it ~700K total vocabulary. Each tweet contains roughly 15 indexable words (after dropping stopwords like "the", "is", "of"). For 730B historical tweets, the inverted index is roughly:

700K words × ~5 bytes (word) + 730B tweets × 5 bytes (tweet_id) × 15 words/tweet ≈ ~21 TB index.

To hold this in RAM (so query latency is microseconds, not milliseconds), at 144 GB RAM per server: 21 TB / 144 GB ≈ ~152 index servers.

MetricValueWhy it matters
New tweets / sec4,600Drives index-update throughput & Kafka sizing
Search QPS5,800Drives aggregator + index-server fleet size
Storage growth120 GB/dayForces sharded primary store (Cassandra)
5-yr tweet storage500 TBSized across ~125 servers
Inverted index size~21 TBSized across ~152 RAM-resident index servers
Index lag SLO< 30 secTweet→searchable latency budget
Step 4

System APIs

One endpoint carries 99% of the search traffic. Defining it early locks the contract before architecture.

REST API surface
// Search — read path, the only endpoint users hit directly
GET /api/v1/search
{
  "api_key":    "abc123...",        // rate-limit identity
  "query":      "world cup final",  // space-separated terms; AND by default
  "count":      20,                 // page size, max 100
  "sort":       "relevance",        // "relevance" | "recency"
  "page_token": "eyJzaGFyZCI6...",  // opaque cursor for pagination
  "filters": {
    "lang":      "en",
    "from_date": "2026-05-01",
    "to_date":   "2026-05-07",
    "user":      null,
    "has_media": false
  }
}

→ 200 OK
{
  "results": [
    {
      "tweet_id":     "1789234...",
      "text":         "Argentina just won the World Cup final!!!",
      "user":         { "id": 42, "handle": "@futbolfan" },
      "created_at":   "2026-05-07T14:02:31Z",
      "like_count":   12453,
      "retweet_count": 4821,
      "score":        0.94
    },
    ...
  ],
  "next_page_token": "eyJzaGFyZCI6...",
  "took_ms":         142
}
Why an opaque page_token instead of an integer offset? An offset like page=5 forces every shard to scan the first 100 results just to skip them — wasteful and slow. The token encodes "where each shard left off" — for example "shard-7 stopped at tweet_id 1789234, shard-12 at 1789201". On the next page, each shard resumes from its own cursor in O(log N) time. Same trick Google, Elasticsearch and DynamoDB all use.
Why rate-limit by api_key? A single hot query like "the" (millions of matches) is 1000× more expensive than "giraffe poetry" (a few hundred). Without rate limits, one bot could exhaust the cluster. We rate-limit by API key (e.g., 100 searches/min for free tier) and reject overly broad queries with a 400.
Step 5

Storage — Tweets vs. Index

The single most important storage insight: we have two completely different storage systems with completely different access patterns. Confusing them is the most common mistake.

📦 Primary Tweet Store (Cassandra)

The source of truth. Every tweet ever posted lives here, keyed by tweet_id. We embed the timestamp into the tweet_id so the IDs are roughly time-sortable (think: Snowflake-style 64-bit IDs where the upper bits are epoch milliseconds).

Access pattern: "give me the full tweet object for tweet_id X" — pure key-value, no joins, no scans. Sharded by tweet_id range (or hash) across ~125 nodes.

Why Cassandra: handles wide-column key-value at petabyte scale, multi-region replication out of the box, tunable consistency.

🔍 Inverted Index (custom, in-memory)

A precomputed map of word → [tweet_id, tweet_id, ...]. Built incrementally from the tweet firehose, held in RAM across ~152 servers, never queried for the actual tweet content.

Access pattern: "give me tweet_ids that contain word X" — return a sorted list. Multi-word queries do set intersection across multiple lookups.

Why a separate system: the index is queried 1000× more often than tweets are written, has very different memory/access characteristics, and updating it can be done asynchronously without blocking writes to the primary store.

So what: a search query touches both — the index tells us which tweets matched (a list of IDs), then the primary store gives us the content of those tweets to return to Sarah. Two systems, two trips, two scaling stories.
Step 6 · CORE

High-Level Architecture — From Naive to Production

This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. Numbers from §3 drive every decision.

Pass 1 — The naive design (and why it breaks)

The simplest possible system: store every tweet in a SQL database. To search, run SELECT * FROM tweets WHERE text LIKE '%world cup%'. One server, one DB.

flowchart LR C["Sarah's App"] --> APP["App Server"] APP --> SQL[("SQL Database — 730B tweets")] SQL -->|"LIKE '%world cup%'"| APP APP --> C

Four concrete failures emerge the moment traffic shows up:

💥 Full-text scans take hours

LIKE '%word%' with leading wildcard cannot use any index — the database must read every row. Scanning 730 billion rows on commodity hardware takes several hours per query, while our SLO is 200 milliseconds. This isn't a tuning problem; it's an algorithmic one.

💥 LIKE has no language understanding

SQL's LIKE is dumb string matching. It won't tokenize ("world-cup" vs "world cup"), won't stem ("running" vs "run"), won't drop stopwords, won't handle hashtags or mentions. A search for "running shoes" misses every tweet that says "runners need new shoes".

💥 No relevance ranking

SQL returns rows in whatever order the storage engine pleases. We need to rank by recency, popularity, and engagement — none of which a LIKE query can express. We'd have to fetch all matches and sort in app code, which compounds the scan problem.

💥 Every search saturates the DB

Even one search at a time would lock the database for hours. At our 5,800 QPS target, the queue would grow infinitely within seconds. Adding read replicas doesn't help — each replica has the same scan problem.

Pass 2 — The mental model: the Inverted Index

The single most important insight in this design: don't search the data — search a precomputed index of the data. This is the same trick used at the back of every textbook: the index lets you go from "I want to know about photosynthesis" straight to "page 247", without flipping through every page in the book.

❌ The naive way (search the data)

Read every tweet, ask "does it contain my word?", keep the matches. Cost: O(N) where N = 730 billion. Hopeless.

✅ The inverted-index way

Precompute, for every word in the language, the list of tweet IDs that contain it. To search, look up the word and return its list. Cost: O(log V) where V = 700K vocabulary. Effectively constant time.

Here's what the index looks like, conceptually:

flowchart LR subgraph IDX["Inverted Index"] direction TB W1["world → [t101, t205, t309, t411, ... 300K IDs]"] W2["cup → [t102, t205, t308, t411, ... 280K IDs]"] W3["final → [t103, t205, t411, t512, ... 150K IDs]"] W4["argentina → [t205, t411, ... 90K IDs]"] end Q["Query: 'world cup final'"] --> IDX IDX -->|"intersect 3 lists"| R["Result: [t205, t411, ... ~50K IDs]"]

For a multi-word query like "world cup final", we look up each word's list and intersect them — only tweets that appear in all three lists match. Modern intersection algorithms on sorted ID lists are extremely fast (linear in the size of the smallest list, with skip pointers shrinking that further).

Two more pieces fall out of this insight:

  • Index-build runs async from search. A separate pipeline consumes the tweet firehose, tokenizes each new tweet, and updates the index. Searches never block on writes; writes never block on searches.
  • The index is too big for one box (21 TB), so we shard it across ~152 servers and let an aggregator fan-out queries.
So what: the inverted index turns "search 730 billion tweets" into "look up 3 words in a hash map, then intersect 3 sorted lists". That single data-structure change is what makes the system viable. Every box we add from here is in service of (a) building the index, (b) keeping it fresh, (c) sharding it, (d) ranking the matches.

Pass 3 — The production shape

Now the full picture. Every node is numbered — find its matching card below to see what it does and what would break without it. The architecture splits into three planes: the Query Plane (handles searches), the Index Plane (holds and updates the inverted index), and the Source Plane (stores the raw tweets and the helper data structures).

flowchart TB CL["① Client — App or Web"] subgraph QPL["Query Plane"] LB["② Load Balancer"] API["③ Search API Server"] AGG["④ Aggregator"] CACHE["⑨ Cache — Redis"] RANK["⑩ Ranking Service"] SUG["⑪ Suggestion Service"] end subgraph IPL["Index Plane"] IDX["⑤ Index Server Cluster"] BUILD["⑥ Index Builder"] KAFKA["Kafka — tweet firehose"] end subgraph SPL["Source Plane"] TDB[("⑦ Tweet Source DB — Cassandra")] REV[("⑧ Reverse Index — tweet→shard map")] end CL --> LB LB --> API API --> CACHE CACHE -.miss.-> AGG API --> SUG AGG -->|"fan-out"| IDX IDX -->|"top-K IDs"| AGG AGG --> TDB TDB --> AGG AGG --> RANK RANK --> API API --> CL TDB -.tweet stream.-> KAFKA KAFKA --> BUILD BUILD --> IDX BUILD --> REV style CL fill:#e8743b,stroke:#e8743b,color:#fff style LB fill:#171d27,stroke:#9b72cf,color:#d4dae5 style API fill:#171d27,stroke:#e8743b,color:#d4dae5 style AGG fill:#171d27,stroke:#e8743b,color:#d4dae5 style IDX fill:#171d27,stroke:#9b72cf,color:#d4dae5 style BUILD fill:#171d27,stroke:#9b72cf,color:#d4dae5 style TDB fill:#171d27,stroke:#38b265,color:#d4dae5 style REV fill:#171d27,stroke:#38b265,color:#d4dae5 style CACHE fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style RANK fill:#171d27,stroke:#d4a838,color:#d4dae5 style SUG fill:#171d27,stroke:#d4a838,color:#d4dae5 style KAFKA fill:#171d27,stroke:#d4a838,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.

Client

Sarah's mobile app, the desktop web client, or a third-party app calling our public API. The client sends a single GET /api/v1/search with the user's query and waits for a JSON list of ranked tweets. From the client's perspective, the entire 152-server index cluster is one URL.

Solves: nothing on its own — but every design choice flows backward from "what does the client experience?" The 200ms latency budget, the page-token pagination, and the rich result objects all exist because of what the client needs.

Load Balancer

The traffic cop. Distributes incoming HTTPS across the Search API server pods, terminates TLS so backend pods don't pay the crypto cost, and yanks unhealthy pods out of rotation via 5-second health checks. Public-facing — typically AWS ALB or nginx behind a global anycast IP.

Solves: single-server bottleneck and single-server failure. Without an LB, one app crash takes down search for everyone. With it, we lose 1/N of capacity for a few seconds until health checks fail the bad pod out.

Search API Server

Stateless service that handles the public /search endpoint. Per request it: parses the query, validates it, checks the cache, on cache miss delegates to the aggregator, calls the ranking service for final scoring, calls the suggestion service for "did you mean", and returns the assembled JSON. Easy to scale because it's stateless — add pods behind the LB.

Solves: isolating user-facing concerns (auth, rate-limits, response shaping) from the heavy machinery of the aggregator and index cluster. Without this tier, every change to the JSON response shape would touch the aggregator — bad coupling.

Aggregator

The heart of the query plane. Takes a tokenized query (e.g. ["world", "cup", "final"]) and fans it out to every index shard in parallel. Each shard returns its local top-K matching tweet_ids; the aggregator merges them, takes the global top-K, fetches the full tweet objects from the Tweet Source DB, and hands them to the ranking service. Think of it as the "scatter-gather" coordinator.

Solves: the fundamental sharding problem. The index is too big for one server (21 TB), so it's split across 152 servers — but a query for "world" might match tweets that live on any of those servers. Without the aggregator, the client would have to query all 152 itself. The aggregator hides the sharding from the rest of the system.

Index Server Cluster

A fleet of ~152 servers, each holding a partition of the inverted index in RAM. Each server indexes only the tweets it owns (we shard the tweets, not the words — see §7) so it sees the full vocabulary but only its slice of tweets. On a query, each server returns the top-K matching tweet_ids it knows about, scored by a local relevance function. Stateless from the network's view but stateful in memory.

Solves: turning a 21 TB index into something that fits in RAM and serves microsecond lookups. Without this cluster, every search would hit disk on a 21 TB SSD array — adding 50-100ms per shard, and the per-shard latency adds up across the fan-out.

Index Builder

The async pipeline that keeps the index fresh. It consumes the tweet firehose from Kafka (every new tweet shows up here within milliseconds of being posted), tokenizes the text (split on spaces, drop punctuation, lowercase), removes stopwords ("the", "and", "is"), applies stemming ("running" → "run"), and pushes the resulting (word, tweet_id) pairs to the appropriate index server. It also writes an entry to the Reverse Index so we can find the home shard of any tweet later.

Solves: the index-freshness problem. We have 4,600 new tweets per second; the index must absorb them within seconds (our SLO) without blocking searches. Without the builder, the index would be a static snapshot — useless for "live" searches like World Cup commentary. The async pipeline lets us scale the builder fleet independently of the index-server fleet.

Tweet Source DB (Cassandra)

The source of truth — every tweet ever posted, keyed by tweet_id (with the timestamp embedded so IDs are roughly time-sortable). 500 TB across ~125 nodes, sharded by tweet_id range. The aggregator hits this only after intersection narrows the result set down to a few thousand candidates — never to scan, only to fetch the body for the top results.

Solves: durable storage of the actual tweet content. The index returns IDs; we still need to show Sarah the tweet text, the author, the like count. Without the source DB, the index is a list of pointers to nowhere.

Reverse Index (tweet → shard map)

A simple HashMap: tweet_id → which index shard owns this tweet's index entries. Maintained by the Index Builder as a side-effect of indexing. Held in memory by a small replicated service, persisted to disk for restart safety.

Solves: fault recovery — the unsung hero of this design. When an index shard dies and we want to rebuild it, we don't want to re-tokenize 730B tweets. With the reverse index, the recovering shard asks "which tweets did I own?" and pulls only those (a few billion) from Cassandra — cutting recovery from days to hours. Without it, a shard loss is a multi-day rebuild.

Cache (Redis)

An in-memory key-value store holding the final ranked result list for hot queries. Cache key = hash of the (query, sort, filters) tuple; value = the final JSON response with tweet IDs and scores. Eviction: LRU. Replicated for fault tolerance; sharded by query hash for capacity.

Solves: the viral-query problem. When the World Cup final ends, "world cup final" gets queried 10,000 times per second for an hour. Without cache, every one of those queries fans out to all 152 index shards — wasted work since the answer is the same. With cache, only the first query does the work; the next 36 million served from RAM in microseconds.

Ranking Service

Scores the candidate tweets by a combination of: recency (newer = higher), popularity (likes + retweets), engagement (replies, click-through), term frequency (how many query words matched), and personalization (does the user follow the author?). Each index shard returns top-K by a fast local score; the ranking service does the final, more expensive ML-based re-ranking on the top few hundred candidates.

Solves: the difference between "results that match" (easy) and "results Sarah actually wants to see" (hard). Without ranking, search would return matches in arbitrary order — and a search for "world cup" would surface a 10-year-old tweet ahead of the goal that just happened.

Suggestion / Spell-correct Service

Powers "did you mean" and the typeahead dropdown as Sarah types. Built from a separate prefix index (trie or finite-state transducer) plus a Levenshtein-distance spell-corrector trained on common search misspellings. Returns suggestions in under 30ms so the typeahead feels instant.

Solves: the "user typed it wrong" problem. Without spell-correct, a search for "wrold cup" returns zero results and a confused user. With it, we silently suggest "world cup" and the search succeeds.

Concrete walkthrough — Sarah searches at 14:02

Two real flows, mapped to the numbered components above:

🔎 Search flow — "world cup final" at 14:02:15

  1. Sarah's app ① sends GET /search?query=world+cup+final. The Load Balancer ② routes it to a Search API Server ③.
  2. API server checks Cache ⑨ — first time today, it's a miss.
  3. API server tokenizes the query → ["world", "cup", "final"], sends it to the Aggregator ④.
  4. Aggregator fans the query out to all 152 Index Servers ⑤ in parallel. Each server intersects its local lists for the three words and returns its top 100 matching tweet_ids with local relevance scores.
  5. Aggregator receives 152 × 100 = 15,200 candidate IDs, merges them by score, takes the global top 200.
  6. Aggregator fetches the full tweet objects for those 200 IDs from the Tweet Source DB ⑦ (parallel reads since IDs scatter across shards).
  7. Tweets handed to the Ranking Service ⑩ which re-ranks using ML model (engagement, personalization, freshness). Top 20 returned.
  8. API server populates Cache ⑨ with the result, returns JSON to Sarah's app. Total elapsed: 142ms.

📝 Index-update flow — a new tweet posted at 14:02:30

  1. Raj posts "What a final! Argentina deserved that World Cup." via the tweet API. The tweet is saved to the Tweet Source DB ⑦ and assigned tweet_id = 1789234.
  2. Cassandra emits a CDC event onto Kafka — the tweet firehose.
  3. An Index Builder ⑥ worker consumes the event, tokenizes the text, drops stopwords, stems → ["final", "argentina", "deserved", "world", "cup"].
  4. For each token, the builder calls the appropriate Index Server ⑤ (the one that owns tweet_id 1789234) and appends the ID to the word's list.
  5. The builder writes 1789234 → shard-12 to the Reverse Index ⑧.
  6. Elapsed: ~8 seconds from post to searchable. By 14:02:38, Sarah's next search will see Raj's tweet.
So what: the architecture is built around three insights — (1) don't search the data, search a precomputed index; (2) shard the index across many machines and use a fan-out aggregator to hide that from the client; (3) build the index asynchronously from the tweet firehose so writes and reads never block each other. Every box in the diagram earns its place by removing one of the failure modes from Pass 1.
Step 7

Detailed Component Design — The Inverted Index

The inverted index is the soul of this system. Let's open it up: how it's structured, how big it gets, and the two competing ways to shard it.

Structure

At its simplest, the inverted index is a HashMap<String, List<Long>>: keys are vocabulary words, values are sorted lists of tweet IDs that contain that word. Think of the index at the back of a textbook — except instead of "page numbers", we store tweet IDs, and instead of one book, we're indexing 730 billion of them.

flowchart LR subgraph CONCEPT["Conceptual Inverted Index"] direction TB K1["world"] --> V1["[t101, t205, t309, t411, t1789234, ...]"] K2["cup"] --> V2["[t102, t205, t308, t411, t1789234, ...]"] K3["final"] --> V3["[t103, t205, t411, t1789234, ...]"] K4["argentina"] --> V4["[t205, t1789234, ...]"] end

Production indexes layer on more sophistication: each posting list also stores per-word frequencies (for TF-IDF scoring), positions (for phrase queries like "world cup" as opposed to "world" AND "cup"), and skip pointers (so intersection across two lists can leap forward in O(log N) instead of O(N)). Lucene/Elasticsearch is essentially a hardened, on-disk version of this same idea.

Memory math

Let's verify our 21 TB number from §3:

  • Vocabulary: ~700K unique words × 5 bytes = 3.5 MB. Negligible.
  • Postings: 730B tweets × 15 indexable words/tweet × 5 bytes/tweet_id = 54.75 TB raw. Compress sorted ID lists (delta encoding + variable-byte) by ~4× → ~14 TB.
  • Per-posting metadata (frequency byte, optional positions): adds ~50% → ~21 TB total.

At 144 GB RAM per server (a beefy but standard box): 21 TB / 144 GB ≈ 152 servers. Add 2× replication for fault tolerance → ~300 index servers in production.

Two sharding strategies — pick one

This is the most interesting trade-off in the design. We have to split 21 TB across 152 servers; the question is how.

❌ (A) Shard by word

Each server owns a slice of the vocabulary. "world" and everything starting with "wo-" lives on shard-7; "cup" lives on shard-3. To answer a query, the aggregator sends one word to each owner.

Pros: a query for one word touches one server. Minimal network fan-out.

Cons: hot words destroy the design. The shard owning "the" (still indexed for some queries) gets every query for English text. Trending words like "world cup" during a final concentrate all load on one server — that server melts while 151 others sit idle. Also, intersecting two words requires shipping one of the lists across the network — for "world" with 300K IDs that's a 1.5 MB transfer per query.

✅ (B) Shard by tweet_id ← winner

Each server owns a slice of the tweet space (e.g., shard-12 owns tweet_ids 1789000000 – 1789999999). It indexes only its own tweets, so it sees the entire vocabulary but only ~5 billion tweets. Every shard has the word "world" — but each holds only its slice of "world"-containing tweets.

Pros: load is uniformly distributed by construction (tweet IDs are roughly time-ordered and randomly hashed). No hot shards. Intersection happens locally on each shard with no network transfer of large lists. Adding a new shard just changes the hash range — no global rebalance.

Cons: every query fans out to every shard (152-way fan-out). That sounds expensive but each shard does its work in parallel, so wall-clock latency is set by the slowest shard, not the sum.

The aggregator's job in strategy (B) is exactly what makes it work: it gathers the per-shard top-K, merges them into a global top-K. With 100 results per shard and 152 shards, the merge is over 15,200 IDs — trivial in milliseconds.

The "before and after" of the choice: with (A), a single trending word would push p99 to seconds. With (B), every query has the same shape — a 152-way parallel fan-out finishing in 50-80ms. Twitter, Elasticsearch, and Solr all use strategy (B) for exactly this reason.
Step 8

Fault Tolerance

Index servers hold their state in RAM. RAM is volatile. Servers crash. So what happens when shard-12 dies at 14:02 with 5 billion tweets' worth of index in memory?

Active-standby replication

Each shard is replicated to a standby server in a different rack. The Index Builder writes to both primary and standby on every update. If the primary fails, the standby takes over within seconds — a brief latency bump (the warm-but-cold standby's RAM caches need a minute to settle), no data loss.

But what if both die — say a power-rail failure takes down a whole rack? Now we have to rebuild the shard's index from scratch.

The naive recovery (don't do this)

"Re-tokenize all 730 billion tweets and re-route each to its owning shard." This works, but takes days. During those days, queries that should have hit shard-12 return incomplete results — potentially missing huge swaths of recent content.

The fast recovery — using the Reverse Index ⑧

Recall: the Reverse Index is a HashMap of tweet_id → owning shard, maintained by the Index Builder as a free side-effect. When shard-12 needs to recover:

sequenceDiagram participant S12 as Recovering Shard-12 participant REV as Reverse Index participant TDB as Tweet Source DB participant BUILD as Index Builder S12->>REV: "Which tweets did I own?" REV-->>S12: List of ~5 billion tweet_ids S12->>TDB: Bulk-fetch those tweets TDB-->>S12: Tweet bodies in batches S12->>BUILD: "Re-index these for me" BUILD->>S12: Stream tokenized postings Note over S12: Index rebuilt in hours, not days

Now we fetch only the 5 billion tweets that actually belong on shard-12 instead of all 730 billion. Recovery drops from days to hours. The Reverse Index itself is small (~3.7 TB across the cluster, sharded by tweet_id) and easily replicated — we lose at most a small region of it on a single failure, recoverable from a daily snapshot.

The lesson: fault recovery is not an afterthought — it's an explicit design decision. Adding the Reverse Index costs us 3.7 TB and a small writer-side bookkeeping cost; in exchange, an entire rack failure becomes a few-hour blip instead of a multi-day outage. The design earned its complexity.
Step 9

Cache

The single biggest performance lever for hot queries. Without a cache, every search of "world cup final" during the actual final does the full 152-way fan-out — even though the answer is identical for every querier in any given second.

📦 What we cache

The final ranked result — the top-20 tweet IDs and their scores, keyed by hash(query, sort, filters). Not the per-shard intermediate lists; those vary across shards and aren't independently reusable. The whole-result cache is a single Redis GET for any cache hit.

♻️ Eviction

LRU with a short TTL (e.g., 30 seconds). Hot queries stay hot; the TTL ensures the result reflects recent tweets — Sarah doesn't get a 5-minute-old top-20 during a fast-moving event. We can drop TTL to 5 seconds for extreme freshness on trending topics.

📐 Sizing

~5,800 QPS, ~30s TTL → at most ~175K hot entries cached. Each entry is small (~1 KB of IDs + scores) → ~175 MB total. Trivially fits on a single Redis instance, but we shard across 4-6 nodes for fault tolerance and to spread the QPS.

Cache hit math: if even 30% of queries hit cache during a major event, we cut index-cluster work by 30% — saving 1700 QPS of 152-way fan-out per second. During a viral query like "world cup final" the hit rate can climb past 95%, effectively absorbing the spike before it reaches the index.
Step 10

Load Balancing

Load balancers sit at two layers in the search path, each playing a slightly different role.

① Client → Search API

Public-facing LB (AWS ALB / nginx). Distributes incoming HTTPS across the Search API pods. Health-checks every 5s, evicts unhealthy pods. Terminates TLS so backend pods don't pay the crypto cost. Round-robin works fine here — every request is roughly the same shape from the API's perspective.

② API → Aggregator

Internal LB (envoy or a service-mesh sidecar). Routes API requests to aggregator pods. Switches to least-connections once the cluster is large enough that one slow aggregator (e.g., processing a viral query) shouldn't pile up new requests behind it.

Why no LB between aggregator and index servers?

The aggregator is the load balancer for the index cluster — it deliberately sends the query to every shard (since we shard by tweet_id, every shard might have matches). There's nothing to "balance" in the traditional sense; it's a deterministic fan-out. Within a shard's replicas (primary + standby), the aggregator picks the healthiest one, that's it.

The LB is itself a single point of failure — solved with an active-passive pair using virtual IP failover, or natively in cloud LBs (ALB is multi-AZ by default).
Step 11

Ranking

"Find the matches" is the easy half of search. "Show Sarah the matches she wants to see" is the hard half. Without ranking, a search for "world cup" might surface a 10-year-old tweet ahead of the goal scored 2 minutes ago.

Two-stage ranking

Production search engines almost universally use a two-stage approach for cost reasons:

Stage 1 — Per-shard fast scoring

Each index server scores its local matches with a cheap formula: score = recency_boost × tf_idf × log(1 + likes). Returns top-K (say, top 100) per shard. Fast, parallel, no cross-shard data needed.

Stage 2 — Global ML re-ranking

The Ranking Service ⑩ takes the top ~200 candidates from across all shards and runs them through an ML model (gradient-boosted tree or small neural net) that scores on richer features: full engagement signals, the searcher's social graph (do they follow the author?), language quality, freshness, has-media boost. Returns the final top 20.

Ranking signals — what goes into the score

  • Recency: exponential decay over hours. A tweet from 2 minutes ago beats one from 2 days ago.
  • Popularity: log-scale of likes + retweets + replies.
  • Term-match quality: how many of the query terms matched, did they appear close together, did they appear in a hashtag.
  • Personalization: does Sarah follow the tweeter? Does she follow people who liked it?
  • Account quality: verified, account age, spam-filter score.
  • Engagement prediction: ML model's estimate of "will Sarah click/like this?".
Why two stages? Running the full ML model on every candidate from every shard would mean scoring ~15,000 tweets per query. Cheap local scoring shrinks that to ~200 before the expensive model runs. Same final quality, 75× less compute.
Step 12

Interview Q&A

Inverted index vs SQL LIKE — why is the index so much faster?
The index trades preprocessing time for query time. SQL LIKE '%word%' with a leading wildcard does a full table scan — O(N) over 730B rows = hours. The inverted index pre-pays the cost of "for every word, what tweets contain it?" once, at index-build time. Then every query is a hash-map lookup → O(1) per word, plus a sorted-list intersection that's roughly O(min list size) with skip pointers. The win comes from doing the work once, in batch, instead of repeating it for every query.
Shard by word vs shard by tweet_id — pros and cons of each?
Shard by tweet_id wins for production. Shard-by-word looks attractive because a single-word query touches only one server, but it dies on hot words — every query for "the", "world", or whatever is trending all hits one server while the rest sit idle. It also forces large lists across the network for intersection. Shard-by-tweet_id distributes load uniformly (tweet IDs are random by construction), keeps intersection local on each shard, and pays only the cost of a 152-way parallel fan-out — which finishes in 50-80ms because each shard works in parallel. The aggregator's merge of 152 small top-K lists is trivial.
How do you keep the index up to date with 4,600 new tweets per second?
An async pipeline driven by Kafka. Every new tweet published to the Tweet Source DB emits a CDC event onto a Kafka topic — the "tweet firehose". A pool of Index Builder workers consumes the topic, tokenizes each tweet, and pushes (word, tweet_id) pairs to the appropriate Index Server. Because the pipeline is async, search reads never block on writes — and we can scale the builder fleet independently (more workers = faster catch-up after a backlog). End-to-end lag is typically under 10 seconds; our SLO is under 30 seconds. If a builder falls behind, Kafka buffers without dropping events.
How do you handle a viral query — say 10K/sec for "world cup final"?
The cache absorbs it almost entirely. Every viral query is by definition repeated — the same exact (query, filters) tuple over and over. With a 30-second cache TTL, only the first query in each TTL window does the 152-way fan-out; the next ~300,000 are served from Redis in microseconds. Hit rate during a viral spike commonly exceeds 95%. We can also pre-emptively warm the cache for known trending terms (the trending-topics service feeds the cache). For the small fraction that does miss, the index cluster has enough headroom to absorb 5-10× normal QPS without falling over.
How do you rank results?
Two-stage ranking. Stage 1: each index server scores its local matches with a cheap formula combining recency, popularity, and term-frequency, returns top 100. Stage 2: the Ranking Service takes ~200 candidates merged from all shards and runs them through an ML model that scores on richer signals — engagement prediction, the searcher's social graph, account quality, freshness boost. Returns the final top 20. Splitting the work this way runs the expensive model 75× less often without losing ranking quality.
What's the latency budget breakdown for 200ms p99?
Roughly:
  • Network round-trip client → API: ~30 ms
  • Cache check (Redis GET): ~2 ms (hit returns here in < 35 ms total)
  • Tokenize + send to aggregator: ~5 ms
  • Aggregator → 152-way parallel fan-out to index servers, slowest shard wins: ~40 ms
  • Aggregator merge top-K: ~5 ms
  • Fetch full tweet bodies from Cassandra (parallel): ~20 ms
  • Ranking ML model on top 200: ~30 ms
  • JSON serialization + return network hop: ~30 ms
Total: ~165 ms — leaving 35 ms of slack for the long-tail shards and GC pauses. The aggregator is the single biggest budget item, which is why we obsess over the slowest-shard tail in the index cluster.
What if an entire index shard dies? How do you recover?
Two layers. First, every shard has an active-standby replica in a different rack — the standby takes over within seconds with no data loss. Second, if both die (e.g., rack-level failure), we use the Reverse Index ⑧ — a HashMap of tweet_id → owning shard maintained as a side-effect of every index update. The recovering shard asks the Reverse Index "which tweets did I own?", bulk-fetches only those from Cassandra (a few billion, not 730), and re-tokenizes them through the Index Builder. Cuts recovery from days down to hours. Without the Reverse Index, a shard loss is a multi-day outage.
Why store the index in RAM rather than on SSD?
Latency math. The fan-out aggregator's wall-clock latency is set by the slowest shard. Disk SSDs add 50-100 microseconds per random read; with multiple lookups per shard per query, that adds 5-10 ms per shard, and the long-tail shards push p99 well past our 200ms budget. RAM lookups are ~100 nanoseconds — three orders of magnitude faster. The cost is just RAM (152 servers × $5K of RAM each = ~$750K), trivial compared to the engineering cost of dealing with disk-induced latency variance.