Facebook Messenger — HLD Design

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →

Step 1

What is Facebook Messenger?

Sarah is sitting on the couch in San Francisco. She types "are you free for dinner?" into the Messenger app and taps send. Two hundred milliseconds later, Raj's phone in Bangalore buzzes with the message — already showing the little "delivered" check next to it. He thumbs back "yes — 8pm?" and Sarah sees it appear at the bottom of her chat thread before she's even finished setting her phone down. That round-trip — text in, text out, the world over — is what we're designing.

Messenger is a real-time text messaging service. Two people (or a group) carry on a conversation, with messages delivered instantly when both are online and queued for later when one isn't. The chat history is the same on every device they own — phone, laptop, tablet — so Sarah can start a conversation on her commute and finish it on her work laptop. Behind that simple experience is a system that has to: hold open hundreds of millions of network connections at the same time, route each message to the right recipient (who may or may not be online), persist every word forever, and never deliver the same message twice.

The two questions that drive every design decision below: (1) How do you push a message from server to client in under 200ms when the client doesn't ask for it? (2) How do you store 20 billion messages a day and still retrieve any given conversation's history in under 50ms?

Step 2

Requirements & Goals

Pin the contract down before drawing boxes. Asking these questions in an interview signals you're going to design for the actual workload, not a memorized solution.

✅ Functional Requirements

Support 1-on-1 conversations between any two users
Track and broadcast online / offline status for each user
Persist chat history — same view on every device the user owns
Show delivered / read receipts per message

⚙️ Non-Functional Requirements

Real-time chat experience — minimum end-to-end latency, target under 200ms
Highly consistent — every device shows the same chat history in the same order
Highly available — but we are willing to sacrifice some availability for consistency (a missing-message bug is worse than a 30-second outage)

➕ Extended

Group chat — N participants, message fanout to all members
Push notifications — wake the recipient's phone via APNS / FCM when they're offline

The "consistency over availability" choice matters. If Raj's phone shows messages 4, 5, 6 but his laptop shows 4, 6 (skipping 5), he loses trust in the product forever. Getting the order and completeness right is more important than 100% uptime. We'll lean on Kafka and HBase because both are built around durability-first.

Step 3

Capacity Estimation & Constraints

Numbers shape architecture. Doing them out loud forces the conversation about sharding, connection pooling, and storage tiering.

Traffic estimates

Assume 500 million daily active users (DAU), each sending 40 messages per day. That's 20 billion messages/day, or roughly 230,000 messages/second on average — and 2-3× that at peak.

Messages/day

~20 billion

500M × 40

Avg msg size

~100 bytes

Text only

Ingress

~25 MB/s

230K × 100B

Egress

~25 MB/s

Each msg goes out once

Storage estimate (5 years)

20B messages × 100 bytes = ~2 TB/day. Over 5 years: ~3.6 PB. This rules out a single relational DB instantly — we need a horizontally-scalable wide-column store and tiered storage (hot recent messages vs. cold archived ones).

Connection estimate (the hidden monster)

The non-obvious capacity number. To deliver messages instantly, every user needs an open WebSocket (a persistent TCP connection) to a chat server. At peak that's 500M concurrent open connections. A single chat server typically holds 25K-50K connections before its socket buffers, file descriptor limits, and CPU give up. So we need at minimum 500M / 50K = ~10,000 chat servers. This number alone is what makes the architecture different from a normal REST app.

Metric	Value	Why it matters
Messages/sec (avg)	`230K/s`	Drives Kafka throughput & HBase write sizing
Concurrent connections	`500M`	Forces a dedicated chat-server tier separate from app servers
5-yr storage	`3.6 PB`	Mandates HBase/HDFS — single-box DBs don't fit at this scale
Bandwidth	`25 MB/s in & out`	Modest — single LB tier handles it; hardware not a bottleneck, software is

Step 4

System APIs

Three functions cover the workload — but the protocol underneath is unusual. Unlike a typical REST API where the client always asks first, chat is a stateful, push-driven protocol built on WebSockets (or long-polling as a fallback). The client opens a connection once and keeps it open; the server pushes messages whenever they arrive.

Conceptual API surface (over WebSocket frames or REST)

// Send a message — client → server frame
sendMessage(api_key, sender_id, receiver_id, text)
  → returns message_id once persisted
  // Server then pushes the same message to receiver_id's open WebSocket
  // (or queues a push notification if receiver is offline)

// Fetch chat history — REST, paginated by timestamp cursor
getConversation(api_key, user1_id, user2_id, before_timestamp, count)
  → returns up to `count` messages older than `before_timestamp`
  // Used on app start to load the last 50 messages of each open thread

// Get presence for a list of users — bulk lookup
getStatus(api_key, user_ids[])
  → returns [{user_id, status: ONLINE|OFFLINE, last_seen}]
  // Used when opening the friends list to show green dots

Why WebSocket (or long-poll) and not plain REST? A naive REST design would have the client poll GET /messages every few seconds. At 500M users polling every 3s, that is 166M requests/sec just to ask "anything new?" — and 99% of those return empty. A WebSocket is a single TCP connection that stays open: zero overhead when idle, instant push when a message lands. The cost shifts from "request rate" to "connection count" — which we then engineer for with a dedicated chat-server tier.

Stateful protocol caveat: because the chat server holds open the user's connection, it is no longer a pure stateless web tier. Reconnection logic, sticky sessions, and connection-state replication all become first-class concerns — most of §6 and §11 are devoted to handling these.

Step 5

Database Design

Three observations decide the database choice: (1) we have massive small writes — 230K tiny rows per second, (2) we read by conversation + time range — "give me the last 50 messages between Sarah and Raj", and (3) data volume is petabyte-scale. This points squarely at a wide-column NoSQL store like HBase (which Facebook actually uses for Messenger).

erDiagram USER { bigint id PK string name string email timestamp last_seen string status } CONVERSATION { bigint id PK bigint user_a_id FK bigint user_b_id FK timestamp created_at timestamp last_message_at } MESSAGE { bigint id PK bigint conversation_id FK bigint sender_id FK string text timestamp ts string status } USER ||--o{ CONVERSATION : "participates" CONVERSATION ||--o{ MESSAGE : "contains"

The MESSAGE table is the giant — billions of rows added per day. In HBase the row key is engineered as (conversation_id, reverse_timestamp) so all messages for one conversation live on one region server, sorted newest-first. A "give me the last 50 messages" query becomes a single short range scan — O(50), not O(billions).

❌ Why not MySQL?

A MySQL server tops out around 5K-10K writes/sec on commodity hardware. We need 230K writes/sec sustained. Sharded MySQL could hit it, but we'd be reinventing what HBase ships with — region splits, replication, tombstoning. Operational nightmare.

❌ Why not MongoDB?

Document-style fine for a single message, but range scans across millions of messages in one conversation are slow and memory-hungry. HBase's columnar layout + row-key sort order is purpose-built for "scan the last N rows for key prefix X".

✅ Why HBase + HDFS?

(a) Write-optimized log-structured storage absorbs 230K writes/sec linearly. (b) Row-key range scans for chat history are O(N) on the result, not the dataset. (c) HDFS underneath gives us 3-way replication + petabyte-scale storage for free. (d) Battle-tested at Facebook for exactly this workload.

The key insight: the choice of HBase isn't because it's "the cool NoSQL" — it's because the access pattern (small high-rate writes + bounded range reads sorted by time) maps perfectly to HBase's storage engine. Pick the database whose strengths match your dominant query, not the one with the best marketing.

Step 6 · CORE

High-Level Architecture — From Naive to Production

This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. The 500M-concurrent-connection number from §3 drives every decision.

Pass 1 — The naive design (and why it breaks)

The simplest thing imaginable: clients poll the server. Every 3 seconds, Sarah's phone sends GET /messages?since=last_id. The server checks the DB and either returns new messages or returns empty. Same for Raj.

flowchart LR S["Sarah's phone"] -->|"poll every 3s"| APP["App Server"] R["Raj's phone"] -->|"poll every 3s"| APP APP --> DB[("Messages DB")]

Three concrete failures show up the moment you do the math:

💥 Wasted polls

500M users polling every 3 seconds is 166 million requests/sec. 99% return empty because most people aren't getting messages most of the time. We'd be running a planetary-scale DDoS on ourselves to hear the answer "nothing new" 99% of the time.

💥 Bad UX latency

Latency = polling interval. A 3-second poll means a message takes on average 1.5 seconds just to appear, and worst case 3. Sarah types "are you free?" and Raj sees it 3 seconds later — feels broken. Drop the interval to 500ms and the request rate becomes 1 billion/sec.

💥 Cost scales with users, not activity

Even if nobody is actually chatting, server load is proportional to user count. Idle users cost the same as active users. We're paying full price 24/7 to mostly say "still nothing".

Pass 2 — The mental model: persistent connections + push

The single most important shift in this design is to flip the request/response model. Chat is fundamentally async push from server to client, not request/response. Instead of the client asking "anything new?" 1000 times, the server holds a connection open and shoves messages down it the instant they arrive.

Two ways to do this — both work, both are used in production:

🔁 Long Polling

Client sends GET /messages; server holds the request open for up to 30 seconds. As soon as a message arrives for that user, server responds. If nothing arrives in 30s, server returns 204 and client immediately reconnects. Works through any HTTP-aware firewall — zero protocol changes — but each "long poll" still ties up one server-side request handler per user.

⚡ WebSockets

A real persistent TCP connection upgraded from HTTP. Bidirectional — server can push, client can send, both at any time, with nearly zero per-message overhead (a few bytes of frame header). One connection per user, lives for hours. The modern default — what Messenger, WhatsApp, and Slack all use.

Both options demand a brand-new tier of servers — chat servers — whose entire job is to hold open millions of connections at once. This is a fundamentally different shape from a stateless REST app server (which handles one request, returns, frees its memory). A chat server is more like a switchboard operator with 50,000 phone lines plugged in simultaneously.

So what changes in the architecture? We split the system into connection plane (chat servers + routing table) that holds state per-user, storage plane (HBase + cache) that persists messages, and async ops (Kafka + push service) that stitches chat servers together and handles offline delivery. Each plane scales differently and has different failure modes.

Pass 3 — The production shape

Now the full picture. Every node is numbered — find its matching card below to see what it does and crucially what would break without it.

flowchart TB CL["① Client — Mobile / Web with WebSocket"] subgraph EDGE["Edge"] LB["② Load Balancer — sticky sessions"] end subgraph CONN["Connection Plane"] CS["③ Chat Server cluster — 50K connections each"] RT[("④ User → ChatServer routing table — Redis")] end subgraph ASYNC["Async Ops"] KAFKA[["⑤ Message Queue — Kafka"]] PUSH["⑧ Push Notification Service"] GRP["⑨ Group Chat Service"] RR["⑪ Read Receipts Service"] end subgraph STORAGE["Storage Plane"] HBASE[("⑥ HBase + HDFS — message history")] CACHE[("⑩ Redis Cache — last 15 msgs per active convo")] end subgraph PRES["Presence"] PS["⑦ Presence Service"] end CL -->|"WebSocket"| LB LB --> CS CS --> RT CS --> KAFKA KAFKA --> CS CS --> HBASE CS --> CACHE CACHE -.miss.-> HBASE CS --> PS PS -.broadcast.-> CS KAFKA --> PUSH PUSH -.->|"APNS / FCM"| CL CS --> GRP GRP --> KAFKA CS --> RR style CL fill:#e8743b,stroke:#e8743b,color:#fff style LB fill:#171d27,stroke:#9b72cf,color:#d4dae5 style CS fill:#171d27,stroke:#4a90d9,color:#d4dae5 style RT fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style KAFKA fill:#171d27,stroke:#d4a838,color:#d4dae5 style HBASE fill:#171d27,stroke:#38b265,color:#d4dae5 style PS fill:#171d27,stroke:#9b72cf,color:#d4dae5 style PUSH fill:#171d27,stroke:#e05252,color:#d4dae5 style GRP fill:#171d27,stroke:#9b72cf,color:#d4dae5 style CACHE fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style RR fill:#171d27,stroke:#d4a838,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card below. Each one answers what is this, why is it here, and what would break without it.

① Client (Mobile / Web)

The Messenger app on Sarah's iPhone, Raj's Android, or anyone's browser. On launch it authenticates, opens a single WebSocket to the load balancer, and keeps that connection alive with periodic ping frames. All sends and receives flow over this one socket. The client also caches recent messages locally for instant rendering when the user opens a thread.

Solves: the user's window into the system. The WebSocket is what makes "instant" possible — the client never has to ask "anything new?", it just listens. Without a persistent client connection, we'd be back in polling hell.

② Load Balancer (sticky sessions)

The traffic cop. Receives the initial HTTPS upgrade request and routes it to a chat server. Crucially it uses sticky sessions — once Sarah's WebSocket is bound to chat server #137, every subsequent frame from her client must reach the same server, because that server holds her open socket. Round-robin won't work; we need a hash on user_id or a cookie-based affinity.

Solves: spreading 500M connections across 10K chat servers without any one being overloaded, while preserving the connection-server binding. Without stickiness, frame #2 from Sarah might land on server #138 — which has never heard of her socket — and the connection breaks.

③ Chat Server cluster

The heart of the connection plane. Each chat server is a long-running process that holds ~50,000 open WebSockets in memory. When a client connects, the server registers the user in the routing table ④. When a message arrives from Sarah, the server looks up Raj's chat server in the routing table — if it's the same box, it's a direct in-memory push; if it's a different box (almost always), it publishes to Kafka ⑤. Stateless from the data's perspective (everything persists to HBase), but stateful from the connection's perspective.

Solves: the "where does the push come from?" problem. A normal REST app server can't do push because it has no persistent link to the client. Chat servers exist exclusively to hold those links. Without them, every message would need to wait for the recipient to ask for it.

④ User → ChatServer routing table (Redis)

A simple in-memory map: user_id → chat_server_id. When Sarah's client connects to chat server #137, the server writes sarah_id: 137 to Redis. When someone wants to push a message to Sarah, they read this entry and know which server to publish for. Updates on connect/disconnect/reconnect. Replicated for fault tolerance — losing it temporarily means we briefly can't deliver pushes (we fall back to push notifications), but doesn't lose messages because they're already in HBase.

Solves: the cross-server addressing problem. With 10,000 chat servers, "which server holds Raj's connection right now?" is a question we ask millions of times per second. A Redis lookup answers it in under 1ms.

⑤ Message Queue (Kafka)

The async bus that stitches chat servers together. When Sarah on chat server X sends a message to Raj on chat server Y, server X doesn't call server Y directly — it publishes the message to a Kafka topic partitioned by recipient_id. Server Y is a Kafka consumer for the partitions that include Raj. This decouples senders from receivers and gives us at-least-once delivery semantics for free, even if a chat server crashes mid-delivery.

Solves: reliable delivery across server boundaries. Without Kafka, server X would have to do an RPC to server Y; if server Y is down or slow, server X blocks; if the message lands but Y crashes before pushing, the message is lost. Kafka persists every message and lets the consumer retry safely.

⑥ Message Storage (HBase + HDFS)

The durable source of truth. Every message gets written here before any "delivered" ack is sent to the sender. Row key is (conversation_id, reverse_timestamp) so a thread's history is contiguous on one region server and sorted newest-first. HDFS underneath replicates each block 3× across racks. Petabyte-scale by design, write-optimized to absorb 230K writes/sec.

Solves: persistent chat history. If we only kept messages in chat-server memory, a server crash would lose everything in flight. HBase is the safety net — once a write succeeds, the message survives any combination of server failures.

⑦ Presence Service

Tracks who is online and broadcasts changes to interested friends. When Sarah's chat server registers her connection, it notifies the Presence Service. When she disconnects (graceful logout, or 30s of socket silence), the service marks her offline. Friends viewing her profile or thread get the green/grey dot updated via a push down their own WebSocket. Heavily debounced — see §9 for why we don't broadcast every micro-state-change.

Solves: the "is Raj online right now?" question that the UI asks constantly. Without a centralized presence service, every chat server would have to ask every other chat server "do you have user X?" every time the UI needed a status — quadratic chatter.

⑧ Push Notification Service

For when the recipient is offline. If Raj's app is closed and he has no open WebSocket, the chat server discovers (via routing table miss) that there's no active connection. It hands the message to the Push Notification Service, which sends it to APNS (Apple) or FCM (Google) — the OS-level push systems that wake up Raj's phone and pop a notification. When Raj taps it, the app opens, reconnects its WebSocket, and pulls the queued messages from HBase.

Solves: the offline-delivery problem. Without push notifications, an offline user only sees the message when they next open the app — which might be hours later. With push, the message reaches them in seconds even if their app was killed.

⑨ Group Chat Service

Manages the metadata for group conversations: group_id → [member_ids]. When Sarah sends a message to "Dinner Squad" (a group of 8), the chat server asks the Group Chat Service for the member list, then publishes one Kafka event per recipient — fanout happens at the group service, not at the sender. Membership changes (add, remove, leave) flow through this service so all members see consistent group state.

Solves: the fanout problem. Without a dedicated group service, each sender would need to know every group member — leaking the data model into every client. Centralizing fanout also lets us cap group size and enforce permissions in one place.

⑩ Cache (Redis) — last 15 messages per active conversation

An in-memory store holding the most recent ~15 messages of every currently-active conversation. When the chat server needs to render Sarah's open thread on her laptop reconnecting, it pulls from Redis (sub-millisecond) instead of HBase (10-30ms). On miss, it loads from HBase and populates the cache. Eviction: LRU; threads no one has touched in days roll out automatically.

Solves: the "instant scroll" feel when opening a thread. Without this cache, every thread open would hit HBase — fine functionally but adds 30ms to every UI interaction. With it, thread switches feel weightless.

⑪ Read Receipts Service

Tracks the per-message delivered and read states that drive the little check-marks. When Raj's client receives a message, it sends a "delivered" ack down its WebSocket; when he scrolls the message into view, the client sends a "read" ack. Both update the message row in HBase and push a state-change frame to the original sender so Sarah's UI updates her checkmarks.

Solves: the user-trust signal. The double-blue-tick is what tells Sarah her message landed. Without it, she'd have no way to know whether Raj saw it or whether the network ate it.

Concrete walkthrough — Sarah messages Raj at 14:02

Two flows, mapped to the numbered components above. First the happy path (Raj online), then the offline path.

✅ Happy path — Raj is online (~200ms total)

Sarah's client ① sends a SEND frame over its WebSocket: "to=raj_id, text=are you free for dinner?"
Frame hits the Load Balancer ② which routes via sticky session to chat server X ③ (the one holding Sarah's socket).
Chat server X assigns a message_id, publishes to Kafka ⑤ (partitioned by Raj's user_id), and concurrently writes to HBase ⑥. Once Kafka acks, server X sends a "delivered to server" ack back to Sarah.
Chat server Y (consumer for Raj's Kafka partition) reads the message, looks up routing table ④: raj_id → server Y. Match! Server Y has Raj's open socket.
Server Y pushes the message frame down Raj's WebSocket. Raj's client ① renders it instantly and fires a "delivered" ack via Read Receipts ⑪.
Sarah's UI updates the checkmark from "sent" to "delivered". Total elapsed: ~200ms.

📵 Offline path — Raj's phone is asleep

Steps 1-3 same as above — message is durably in HBase ⑥.
Chat server Y consumes from Kafka ⑤, looks up routing table ④: raj_id → null (no active connection).
Server Y hands the message to the Push Notification Service ⑧, which posts to APNS/FCM.
Apple's APNS wakes Raj's phone screen with a notification. Raj taps it 5 minutes later — his app opens, reconnects its WebSocket through the LB to some chat server Z.
Chat server Z queries HBase ⑥ for messages newer than Raj's last seen message_id, pushes them all down his WebSocket, then sends "delivered" acks back via Read Receipts ⑪.

So what: Messenger's architecture is built around three insights — (1) chat is push, not pull, so we hold a persistent WebSocket per user on dedicated chat servers; (2) delivery across servers needs a buffer, so Kafka stitches chat servers together with at-least-once durability; (3) online and offline recipients need different paths, so an explicit routing-table lookup decides between a WebSocket push and an APNS/FCM wake-up. Every box in the diagram exists to remove one failure mode from the polling-based naive design.

Step 7

Detailed Component — Message Handling

Zoom into what happens between the moment Sarah taps "send" and the moment Raj's UI ticks blue. Several invariants must hold simultaneously: messages must persist before any ack, must arrive in send-order within a conversation, and must never be delivered twice even if a server retries.

sequenceDiagram participant S as Sarah Client participant CSX as Chat Server X participant K as Kafka participant HB as HBase participant CSY as Chat Server Y participant R as Raj Client Note over S,CSX: Sarah's WebSocket already open to server X S->>CSX: SEND frame — "are you free?" CSX->>CSX: assign message_id, set ts par Persist + Route CSX->>HB: PUT message row CSX->>K: publish to topic for raj_id end HB-->>CSX: ack K-->>CSX: ack CSX-->>S: SENT ack — message_id Note over CSY: Consumer for Raj's partition K->>CSY: deliver message CSY->>CSY: lookup routing table — raj_id online on me? CSY->>R: PUSH frame R-->>CSY: DELIVERED ack CSY->>HB: update message status = DELIVERED CSY->>CSX: forward DELIVERED via Kafka CSX-->>S: status update — DELIVERED

Invariants and how each is enforced

📌 Ordering within a conversation

Two messages from Sarah to Raj must arrive in send order. Guaranteed by partitioning Kafka on conversation_id (or sender_id) — all messages for one conversation flow through one Kafka partition, which Kafka delivers in order. Within HBase, the row-key suffix is a strictly-increasing timestamp so reads also return in order.

📌 Idempotency on retry

Network burps mean the chat server may retry a Kafka publish or HBase write. Each message carries a client-assigned idempotency key (UUID generated by the client on send). HBase uses it as part of the row key — a duplicate write is a no-op. Kafka consumers also dedupe on this key before pushing to the recipient.

📌 Persist-before-ack

Sarah only sees "sent" once HBase has confirmed the write. If the chat server crashes between Kafka publish and HBase ack, the message is still in Kafka and will be re-consumed and re-written. If it crashes before either, Sarah's client times out and the user retries — same idempotency key, no duplicate.

Step 8

Storing & Retrieving Messages

Where the rubber meets the road for the storage plane. We must absorb 230K small writes/sec and serve "show me the last 50 messages of this thread" reads in under 50ms. Both at petabyte scale.

Why HBase wins for this exact workload

📝 Write throughput

HBase is log-structured (LSM trees). Writes go to an in-memory MemStore and a write-ahead log (WAL) on HDFS — both append-only, no random disk seeks. A single region server absorbs tens of thousands of small writes/sec. Sharding across hundreds of region servers gives us the 230K/sec we need with linear scale-out.

📖 Range scans by row-key

Row key = (conversation_id, MAX_LONG - timestamp). The MAX_LONG - timestamp trick reverses sort order so the newest messages come first. Reading "last 50 messages of conv 12345" is a single short scan starting at row (12345, 0) — touches one region, returns 50 cells. O(50), not O(billions).

📦 Petabyte storage

HDFS underneath HBase is built for multi-petabyte data with 3× replication. Adding capacity = adding nodes. Old, cold data automatically migrates to cheaper storage tiers via HDFS storage policies — recent messages on SSD, year-old messages on HDD or even S3 Glacier.

Pagination via timestamp cursors

The client never asks for "page 5" — it asks "give me 50 messages older than timestamp T". Server returns 50 messages and the timestamp of the oldest one. Client uses that as the cursor for the next request. This is robust to new messages arriving mid-pagination (no shifting page numbers) and trivially cacheable.

// Read API
GET /conversations/{conv_id}/messages?before_ts=1736400000&limit=50
→ 200 OK {
    "messages": [...50 msgs...],
    "next_cursor": "1736399650"  // ts of oldest msg returned
  }

Hot conversation cache

The 15 most recent messages of every active conversation live in Redis ⑩, keyed by conv_id. When Sarah opens her thread with Raj, the chat server hits Redis first; cache hit returns in under 1ms, miss falls back to HBase and warms the cache. Every new message also pushes onto the Redis list (and evicts the oldest if length > 15) so the cache stays current automatically.

Step 9

Managing User Status (Presence)

Every user has hundreds of friends; every friend's UI wants to show a green/grey dot. The naïve approach — broadcast every state transition to every friend — would melt at 500M users. Three techniques tame it.

🚀 Pull on app start

When Sarah opens Messenger, the client sends one bulk request: getStatus([friend_ids]). Server returns the current status of all her friends in one round trip. No subscription, no per-friend chatter — a single read against the Presence Service's in-memory map.

📢 Push only on real transitions

Don't broadcast every micro-state-change. Push to friends only when the user transitions OFFLINE → ONLINE or ONLINE → OFFLINE. Heartbeat fluctuations don't count. Use a 5-second debounce on offline-detection so a quick subway-tunnel disconnect doesn't trigger a "left chat" broadcast.

👀 Lazy pull on conversation open

When Sarah opens her thread with Raj specifically, the client refreshes Raj's status with a targeted call. This catches the case where Raj came online while Sarah's app was backgrounded and the broadcast didn't reach her.

Why debouncing matters: at 500M users, even 1 status change per user per minute is 8.3M broadcasts/sec — and each broadcast fans out to maybe 100 friends, multiplying to 833M push frames/sec. A 5-second debounce cuts that 5×. Combined with broadcasting only to friends currently online (not all friends), we shave another 10× off. Without these the presence system alone would dwarf the message system.

Step 10

Data Partitioning & Replication

3.6 PB doesn't fit on one box, and even if it did we couldn't survive its failure. Sharding strategy decides everything downstream — query latency, hot-spot risk, rebalancing cost.

❌ Shard by message_id

Distributes individual messages randomly across shards. Each write hits one shard — perfect write distribution. But reading a chat history "last 50 messages between Sarah and Raj" now requires scanning every shard and merging — fan-out reads of O(num_shards). With 100 shards, every chat-open is 100 simultaneous queries. Unworkable.

✅ Shard by user_id (or conversation_id)

All of one user's messages live on the same shard. Chat-history reads touch exactly one shard — O(1) shard count, O(50) result rows. This is the right answer for a read-by-conversation workload. Trade-off: a hyperactive user (a celebrity, a bot) creates a hot shard.

Handling the hot-user problem

What if a celebrity's account gets 100K messages/min and burns down their shard? Two mitigations:

🔄 Consistent hashing

Use consistent hashing (not modulo) so adding a new shard only relocates 1/N of users instead of all of them. Spreads the "split the hot shard" operation over hours instead of days, with the cluster usable throughout.

👥 Replica nodes for hot keys

For known-hot users, replicate their data to multiple read replicas. Reads round-robin across replicas; writes still go to the primary. Buys 3-5× headroom on read traffic for power users without re-sharding.

Replication strategy

HDFS replicates each block 3× across racks by default — survives single-disk and single-rack failures with zero data loss. HBase region servers are stateless against the underlying HDFS, so a region-server crash means another server picks up the regions in seconds. For multi-region disaster recovery, HBase replication streams writes to a standby cluster in another data center asynchronously.

Step 11

Fault Tolerance & Replication

Things will break. The hard question is which kind of break causes data loss versus a brief interruption. Walk through each failure mode.

💥 Chat server crashes with 50K open sockets

The 50K clients see their WebSocket drop. They auto-reconnect (with exponential backoff) and the LB routes them to a healthy chat server. The new server registers them in the routing table ④, and any messages that arrived during the gap are pulled from HBase on the first read. Connection state is replicated to a standby chat server in real time so the routing-table updates have minimal lag.

Data loss: zero. Messages already persisted to HBase are intact. In-flight messages still in Kafka get re-consumed. Messages that hadn't yet reached the chat server are retried by the client (idempotent).

💥 HBase region server crashes

HBase auto-detects via ZooKeeper, reassigns the regions to a healthy server, replays the WAL from HDFS. Recovery takes seconds to a minute. During the gap, writes to those regions queue in the chat-server WAL or Kafka and replay once the region is back. HDFS 3× replication ensures no block is lost.

💥 Kafka broker crashes

Kafka topics are replicated across brokers (replication factor 3). A broker loss shifts leadership of its partitions to a replica in milliseconds. Producers retry transparently. At-least-once delivery is preserved — the dedup key catches the rare double-delivery that follows.

💥 Routing table (Redis) loss

Redis is replicated and persisted, but in the worst case if we lose it entirely, we don't lose messages — we just lose the ability to push them in real-time. Recipients fall back to push notifications via APNS/FCM. The table rebuilds itself within seconds as clients reconnect and re-register. Brief degraded mode, no data loss.

The data-loss budget is zero. Every component above either replicates synchronously or has an idempotent retry path. The cost is occasional latency spikes during failover (seconds, not minutes) and very rare double-delivery (caught by client-side dedup on idempotency key). Both are acceptable; losing a message is not.

Step 12

Extended — Group Chat & Push Notifications

Group Chat

Conceptually a group is just "1-on-1 chat with N people". Implementation-wise, the Group Chat Service ⑨ holds a GroupChat object: {group_id, name, members: [user_ids], created_at}. When Sarah sends a message to "Dinner Squad" (8 members):

sequenceDiagram participant S as Sarah Client participant CS as Chat Server participant GS as Group Service participant K as Kafka participant HB as HBase participant M as 7 other members chat servers S->>CS: SEND to group_id=DinnerSquad CS->>GS: getMembers(DinnerSquad) GS-->>CS: [raj, alice, bob, ... 7 ids] CS->>HB: PUT one message row, conv_id=DinnerSquad loop For each member except sender CS->>K: publish — recipient_id = member end K->>M: deliver to each member's chat server M->>M: push down WebSocket

One write to HBase (the message lives once in the group conversation's row family), N publishes to Kafka (one per recipient). Per-member delivered/read state is tracked separately in the Read Receipts service so Sarah can see "Raj read it, Alice didn't".

Cap on group size: typically 250-500 members for performance. Beyond that, fanout becomes a meaningful cost — broadcast-style channels (Slack, Discord) use a different architecture where members "subscribe" to a channel rather than the sender fanning out.

Push Notifications

The Push Notification Service ⑧ is a Kafka consumer for "undeliverable" messages — the ones whose recipient routing-table lookup returned null. For each, it:

Looks up the recipient's device tokens (one per registered phone/tablet — APNS for Apple, FCM for Android).
Builds a payload: sender name, snippet of message text, badge count.
POSTs to APNS or FCM with the device token. The OS push service delivers to the device, possibly with a few seconds of additional delay.
If the device has no token (signed out, uninstalled), drops the push silently. The message is still in HBase for next sign-in.

Push is asynchronous and lossy by design — a missed push isn't catastrophic because the message still pulls down from HBase when the user opens the app. But the latency is meaningful: a 5-10s wake-up time is normal, vs. 200ms for a live WebSocket.

Step 13

Interview Q&A

Message lifecycle — state diagram

Every message moves through a small state machine driven by acks from each side:

stateDiagram-v2 [*] --> SENDING : user taps send SENDING --> SENT : chat server acks persistence SENT --> DELIVERED : recipient client acks receipt DELIVERED --> READ : recipient scrolls into view SENDING --> FAILED : timeout / error FAILED --> SENDING : user retry READ --> [*]

The UI checkmarks track this state directly: clock icon = SENDING, single grey check = SENT, double grey check = DELIVERED, double blue check = READ, red exclamation = FAILED.

Common interview questions

Why HBase over MongoDB or MySQL?

Workload fit. Our access pattern is "high-rate small writes + bounded range reads sorted by time". HBase's LSM-tree storage absorbs 230K writes/sec linearly; its row-key sort lets us read "last 50 messages of conv X" with one short range scan touching one region. MongoDB's document model handles individual messages fine but range scans across millions of rows are slow and memory-hungry. MySQL caps at ~10K writes/sec per box and would need extensive sharding work HBase ships out of the box. Facebook actually runs Messenger on HBase for exactly these reasons.

How do you handle a chat server crash with 50K open connections?

Three layers. (1) Clients auto-reconnect with exponential backoff; the LB routes them to a healthy server within seconds. (2) Connection state (user → server binding in the routing table) is replicated to a standby — failover happens before the cache is cold. (3) Any messages that were in-flight when the server died are still in Kafka and HBase — the new server pulls them on first reconnect. Zero message loss, brief delivery delay (5-10s) for the affected users.

How do you ensure message ordering within a conversation?

Single-partition serialization. Kafka is partitioned by conversation_id, so all messages for one conversation flow through one partition — Kafka guarantees in-order delivery within a partition. The chat server assigns a strictly-increasing timestamp before publishing, and HBase row keys use that timestamp so reads also return in order. Cross-conversation ordering isn't guaranteed (and doesn't need to be — they're independent threads).

Push notifications vs. polling vs. long-poll vs. WebSocket — when each?

Polling is fine for low-frequency, low-user-count cases (e.g., "check if my CI build finished" every 30s). Long-polling works through any HTTP firewall — good fallback when WebSockets are blocked, but each open poll ties up a request handler. WebSockets are the modern default for real-time chat — bidirectional, low overhead, scales to millions of connections per server cluster. Push notifications (APNS/FCM) are for when the app isn't running — they wake the device but have 5-10s latency and are lossy. Messenger uses WebSockets when the app is open, falls back to push when it's closed.

How would you scale to 1 billion users?

Three axes. (1) Chat servers: just add more — they're stateless against persistent storage, so horizontal scaling is trivial. 1B users at 50K connections/server = 20K servers. (2) HBase: scales by adding region servers and HDFS data nodes; consistent-hashing region splits handle redistribution. (3) Kafka: add brokers and re-partition topics; partition count = chat server count for clean ownership. The bottleneck shifts to operations — managing 20K chat servers and a multi-PB HBase cluster — which is why companies at this scale invest heavily in automation and observability.

How do you handle network partitions where two devices can't reach each other?

The system is partition-tolerant by design. Sarah and Raj never talk peer-to-peer — both talk to our servers. If the partition is between Sarah's device and our servers, her client buffers the outgoing message locally and retries with the same idempotency key once connectivity returns. If the partition is inside our DC (between two chat servers, say), Kafka and HBase are partition-tolerant CP systems — they may briefly refuse writes to maintain consistency, but no data is lost or reordered. We chose consistency over availability up front (§2) precisely so these scenarios resolve cleanly.

How do you prevent duplicate message delivery?

Client-generated idempotency keys. Each SEND frame from the client carries a UUID generated when the user first taps send. The chat server uses this UUID as part of the HBase row key — a duplicate write with the same key is a silent no-op. Kafka consumers also dedupe on this key in a small Redis set before pushing to recipients. So even if the chat server retries the publish or HBase write 5 times, the recipient sees the message exactly once.

How do you support sending messages while offline?

Client-side outbox queue. The Messenger client maintains a local outbox (SQLite or similar). When Sarah types a message with no connectivity, the message is enqueued locally and shown with the "clock" icon. When the WebSocket reconnects, the client drains the outbox in order, sending each message with its original idempotency key. The server processes them as if they had arrived in real time — they get their original send timestamp from the client (with light server-side clock-skew correction) so chat history shows them at the right point in the timeline.

The one-line summary the interviewer remembers: "It's a tier of WebSocket-holding chat servers fronted by sticky load balancers, stitched together with Kafka for cross-server delivery, backed by HBase for petabyte-scale chat history, and routed via a Redis user-to-server table — with a separate push-notification path for offline recipients."