From "poll the server every 3 seconds" to 500 million open WebSockets, HBase-backed history, and Kafka-stitched chat servers — the architecture that earns every box
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
Sarah is sitting on the couch in San Francisco. She types "are you free for dinner?" into the Messenger app and taps send. Two hundred milliseconds later, Raj's phone in Bangalore buzzes with the message — already showing the little "delivered" check next to it. He thumbs back "yes — 8pm?" and Sarah sees it appear at the bottom of her chat thread before she's even finished setting her phone down. That round-trip — text in, text out, the world over — is what we're designing.
Messenger is a real-time text messaging service. Two people (or a group) carry on a conversation, with messages delivered instantly when both are online and queued for later when one isn't. The chat history is the same on every device they own — phone, laptop, tablet — so Sarah can start a conversation on her commute and finish it on her work laptop. Behind that simple experience is a system that has to: hold open hundreds of millions of network connections at the same time, route each message to the right recipient (who may or may not be online), persist every word forever, and never deliver the same message twice.
Pin the contract down before drawing boxes. Asking these questions in an interview signals you're going to design for the actual workload, not a memorized solution.
Numbers shape architecture. Doing them out loud forces the conversation about sharding, connection pooling, and storage tiering.
Assume 500 million daily active users (DAU), each sending 40 messages per day. That's 20 billion messages/day, or roughly 230,000 messages/second on average — and 2-3× that at peak.
~20 billion
500M × 40
~100 bytes
Text only
~25 MB/s
230K × 100B
~25 MB/s
Each msg goes out once
20B messages × 100 bytes = ~2 TB/day. Over 5 years: ~3.6 PB. This rules out a single relational DB instantly — we need a horizontally-scalable wide-column store and tiered storage (hot recent messages vs. cold archived ones).
The non-obvious capacity number. To deliver messages instantly, every user needs an open WebSocket (a persistent TCP connection) to a chat server. At peak that's 500M concurrent open connections. A single chat server typically holds 25K-50K connections before its socket buffers, file descriptor limits, and CPU give up. So we need at minimum 500M / 50K = ~10,000 chat servers. This number alone is what makes the architecture different from a normal REST app.
| Metric | Value | Why it matters |
|---|---|---|
| Messages/sec (avg) | 230K/s | Drives Kafka throughput & HBase write sizing |
| Concurrent connections | 500M | Forces a dedicated chat-server tier separate from app servers |
| 5-yr storage | 3.6 PB | Mandates HBase/HDFS — single-box DBs don't fit at this scale |
| Bandwidth | 25 MB/s in & out | Modest — single LB tier handles it; hardware not a bottleneck, software is |
Three functions cover the workload — but the protocol underneath is unusual. Unlike a typical REST API where the client always asks first, chat is a stateful, push-driven protocol built on WebSockets (or long-polling as a fallback). The client opens a connection once and keeps it open; the server pushes messages whenever they arrive.
Conceptual API surface (over WebSocket frames or REST)// Send a message — client → server frame sendMessage(api_key, sender_id, receiver_id, text) → returns message_id once persisted // Server then pushes the same message to receiver_id's open WebSocket // (or queues a push notification if receiver is offline) // Fetch chat history — REST, paginated by timestamp cursor getConversation(api_key, user1_id, user2_id, before_timestamp, count) → returns up to `count` messages older than `before_timestamp` // Used on app start to load the last 50 messages of each open thread // Get presence for a list of users — bulk lookup getStatus(api_key, user_ids[]) → returns [{user_id, status: ONLINE|OFFLINE, last_seen}] // Used when opening the friends list to show green dots
GET /messages every few seconds. At 500M users polling every 3s, that is 166M requests/sec just to ask "anything new?" — and 99% of those return empty. A WebSocket is a single TCP connection that stays open: zero overhead when idle, instant push when a message lands. The cost shifts from "request rate" to "connection count" — which we then engineer for with a dedicated chat-server tier.Three observations decide the database choice: (1) we have massive small writes — 230K tiny rows per second, (2) we read by conversation + time range — "give me the last 50 messages between Sarah and Raj", and (3) data volume is petabyte-scale. This points squarely at a wide-column NoSQL store like HBase (which Facebook actually uses for Messenger).
The MESSAGE table is the giant — billions of rows added per day. In HBase the row key is engineered as (conversation_id, reverse_timestamp) so all messages for one conversation live on one region server, sorted newest-first. A "give me the last 50 messages" query becomes a single short range scan — O(50), not O(billions).
A MySQL server tops out around 5K-10K writes/sec on commodity hardware. We need 230K writes/sec sustained. Sharded MySQL could hit it, but we'd be reinventing what HBase ships with — region splits, replication, tombstoning. Operational nightmare.
Document-style fine for a single message, but range scans across millions of messages in one conversation are slow and memory-hungry. HBase's columnar layout + row-key sort order is purpose-built for "scan the last N rows for key prefix X".
(a) Write-optimized log-structured storage absorbs 230K writes/sec linearly. (b) Row-key range scans for chat history are O(N) on the result, not the dataset. (c) HDFS underneath gives us 3-way replication + petabyte-scale storage for free. (d) Battle-tested at Facebook for exactly this workload.
This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. The 500M-concurrent-connection number from §3 drives every decision.
The simplest thing imaginable: clients poll the server. Every 3 seconds, Sarah's phone sends GET /messages?since=last_id. The server checks the DB and either returns new messages or returns empty. Same for Raj.
Three concrete failures show up the moment you do the math:
500M users polling every 3 seconds is 166 million requests/sec. 99% return empty because most people aren't getting messages most of the time. We'd be running a planetary-scale DDoS on ourselves to hear the answer "nothing new" 99% of the time.
Latency = polling interval. A 3-second poll means a message takes on average 1.5 seconds just to appear, and worst case 3. Sarah types "are you free?" and Raj sees it 3 seconds later — feels broken. Drop the interval to 500ms and the request rate becomes 1 billion/sec.
Even if nobody is actually chatting, server load is proportional to user count. Idle users cost the same as active users. We're paying full price 24/7 to mostly say "still nothing".
The single most important shift in this design is to flip the request/response model. Chat is fundamentally async push from server to client, not request/response. Instead of the client asking "anything new?" 1000 times, the server holds a connection open and shoves messages down it the instant they arrive.
Two ways to do this — both work, both are used in production:
Client sends GET /messages; server holds the request open for up to 30 seconds. As soon as a message arrives for that user, server responds. If nothing arrives in 30s, server returns 204 and client immediately reconnects. Works through any HTTP-aware firewall — zero protocol changes — but each "long poll" still ties up one server-side request handler per user.
A real persistent TCP connection upgraded from HTTP. Bidirectional — server can push, client can send, both at any time, with nearly zero per-message overhead (a few bytes of frame header). One connection per user, lives for hours. The modern default — what Messenger, WhatsApp, and Slack all use.
Both options demand a brand-new tier of servers — chat servers — whose entire job is to hold open millions of connections at once. This is a fundamentally different shape from a stateless REST app server (which handles one request, returns, frees its memory). A chat server is more like a switchboard operator with 50,000 phone lines plugged in simultaneously.
Now the full picture. Every node is numbered — find its matching card below to see what it does and crucially what would break without it.
Use the numbers in the diagram above to find the matching card below. Each one answers what is this, why is it here, and what would break without it.
The Messenger app on Sarah's iPhone, Raj's Android, or anyone's browser. On launch it authenticates, opens a single WebSocket to the load balancer, and keeps that connection alive with periodic ping frames. All sends and receives flow over this one socket. The client also caches recent messages locally for instant rendering when the user opens a thread.
Solves: the user's window into the system. The WebSocket is what makes "instant" possible — the client never has to ask "anything new?", it just listens. Without a persistent client connection, we'd be back in polling hell.
The traffic cop. Receives the initial HTTPS upgrade request and routes it to a chat server. Crucially it uses sticky sessions — once Sarah's WebSocket is bound to chat server #137, every subsequent frame from her client must reach the same server, because that server holds her open socket. Round-robin won't work; we need a hash on user_id or a cookie-based affinity.
Solves: spreading 500M connections across 10K chat servers without any one being overloaded, while preserving the connection-server binding. Without stickiness, frame #2 from Sarah might land on server #138 — which has never heard of her socket — and the connection breaks.
The heart of the connection plane. Each chat server is a long-running process that holds ~50,000 open WebSockets in memory. When a client connects, the server registers the user in the routing table ④. When a message arrives from Sarah, the server looks up Raj's chat server in the routing table — if it's the same box, it's a direct in-memory push; if it's a different box (almost always), it publishes to Kafka ⑤. Stateless from the data's perspective (everything persists to HBase), but stateful from the connection's perspective.
Solves: the "where does the push come from?" problem. A normal REST app server can't do push because it has no persistent link to the client. Chat servers exist exclusively to hold those links. Without them, every message would need to wait for the recipient to ask for it.
A simple in-memory map: user_id → chat_server_id. When Sarah's client connects to chat server #137, the server writes sarah_id: 137 to Redis. When someone wants to push a message to Sarah, they read this entry and know which server to publish for. Updates on connect/disconnect/reconnect. Replicated for fault tolerance — losing it temporarily means we briefly can't deliver pushes (we fall back to push notifications), but doesn't lose messages because they're already in HBase.
Solves: the cross-server addressing problem. With 10,000 chat servers, "which server holds Raj's connection right now?" is a question we ask millions of times per second. A Redis lookup answers it in under 1ms.
The async bus that stitches chat servers together. When Sarah on chat server X sends a message to Raj on chat server Y, server X doesn't call server Y directly — it publishes the message to a Kafka topic partitioned by recipient_id. Server Y is a Kafka consumer for the partitions that include Raj. This decouples senders from receivers and gives us at-least-once delivery semantics for free, even if a chat server crashes mid-delivery.
Solves: reliable delivery across server boundaries. Without Kafka, server X would have to do an RPC to server Y; if server Y is down or slow, server X blocks; if the message lands but Y crashes before pushing, the message is lost. Kafka persists every message and lets the consumer retry safely.
The durable source of truth. Every message gets written here before any "delivered" ack is sent to the sender. Row key is (conversation_id, reverse_timestamp) so a thread's history is contiguous on one region server and sorted newest-first. HDFS underneath replicates each block 3× across racks. Petabyte-scale by design, write-optimized to absorb 230K writes/sec.
Solves: persistent chat history. If we only kept messages in chat-server memory, a server crash would lose everything in flight. HBase is the safety net — once a write succeeds, the message survives any combination of server failures.
Tracks who is online and broadcasts changes to interested friends. When Sarah's chat server registers her connection, it notifies the Presence Service. When she disconnects (graceful logout, or 30s of socket silence), the service marks her offline. Friends viewing her profile or thread get the green/grey dot updated via a push down their own WebSocket. Heavily debounced — see §9 for why we don't broadcast every micro-state-change.
Solves: the "is Raj online right now?" question that the UI asks constantly. Without a centralized presence service, every chat server would have to ask every other chat server "do you have user X?" every time the UI needed a status — quadratic chatter.
For when the recipient is offline. If Raj's app is closed and he has no open WebSocket, the chat server discovers (via routing table miss) that there's no active connection. It hands the message to the Push Notification Service, which sends it to APNS (Apple) or FCM (Google) — the OS-level push systems that wake up Raj's phone and pop a notification. When Raj taps it, the app opens, reconnects its WebSocket, and pulls the queued messages from HBase.
Solves: the offline-delivery problem. Without push notifications, an offline user only sees the message when they next open the app — which might be hours later. With push, the message reaches them in seconds even if their app was killed.
Manages the metadata for group conversations: group_id → [member_ids]. When Sarah sends a message to "Dinner Squad" (a group of 8), the chat server asks the Group Chat Service for the member list, then publishes one Kafka event per recipient — fanout happens at the group service, not at the sender. Membership changes (add, remove, leave) flow through this service so all members see consistent group state.
Solves: the fanout problem. Without a dedicated group service, each sender would need to know every group member — leaking the data model into every client. Centralizing fanout also lets us cap group size and enforce permissions in one place.
An in-memory store holding the most recent ~15 messages of every currently-active conversation. When the chat server needs to render Sarah's open thread on her laptop reconnecting, it pulls from Redis (sub-millisecond) instead of HBase (10-30ms). On miss, it loads from HBase and populates the cache. Eviction: LRU; threads no one has touched in days roll out automatically.
Solves: the "instant scroll" feel when opening a thread. Without this cache, every thread open would hit HBase — fine functionally but adds 30ms to every UI interaction. With it, thread switches feel weightless.
Tracks the per-message delivered and read states that drive the little check-marks. When Raj's client receives a message, it sends a "delivered" ack down its WebSocket; when he scrolls the message into view, the client sends a "read" ack. Both update the message row in HBase and push a state-change frame to the original sender so Sarah's UI updates her checkmarks.
Solves: the user-trust signal. The double-blue-tick is what tells Sarah her message landed. Without it, she'd have no way to know whether Raj saw it or whether the network ate it.
Two flows, mapped to the numbered components above. First the happy path (Raj online), then the offline path.
SEND frame over its WebSocket: "to=raj_id, text=are you free for dinner?"raj_id → server Y. Match! Server Y has Raj's open socket.raj_id → null (no active connection).Zoom into what happens between the moment Sarah taps "send" and the moment Raj's UI ticks blue. Several invariants must hold simultaneously: messages must persist before any ack, must arrive in send-order within a conversation, and must never be delivered twice even if a server retries.
Two messages from Sarah to Raj must arrive in send order. Guaranteed by partitioning Kafka on conversation_id (or sender_id) — all messages for one conversation flow through one Kafka partition, which Kafka delivers in order. Within HBase, the row-key suffix is a strictly-increasing timestamp so reads also return in order.
Network burps mean the chat server may retry a Kafka publish or HBase write. Each message carries a client-assigned idempotency key (UUID generated by the client on send). HBase uses it as part of the row key — a duplicate write is a no-op. Kafka consumers also dedupe on this key before pushing to the recipient.
Sarah only sees "sent" once HBase has confirmed the write. If the chat server crashes between Kafka publish and HBase ack, the message is still in Kafka and will be re-consumed and re-written. If it crashes before either, Sarah's client times out and the user retries — same idempotency key, no duplicate.
Where the rubber meets the road for the storage plane. We must absorb 230K small writes/sec and serve "show me the last 50 messages of this thread" reads in under 50ms. Both at petabyte scale.
HBase is log-structured (LSM trees). Writes go to an in-memory MemStore and a write-ahead log (WAL) on HDFS — both append-only, no random disk seeks. A single region server absorbs tens of thousands of small writes/sec. Sharding across hundreds of region servers gives us the 230K/sec we need with linear scale-out.
Row key = (conversation_id, MAX_LONG - timestamp). The MAX_LONG - timestamp trick reverses sort order so the newest messages come first. Reading "last 50 messages of conv 12345" is a single short scan starting at row (12345, 0) — touches one region, returns 50 cells. O(50), not O(billions).
HDFS underneath HBase is built for multi-petabyte data with 3× replication. Adding capacity = adding nodes. Old, cold data automatically migrates to cheaper storage tiers via HDFS storage policies — recent messages on SSD, year-old messages on HDD or even S3 Glacier.
The client never asks for "page 5" — it asks "give me 50 messages older than timestamp T". Server returns 50 messages and the timestamp of the oldest one. Client uses that as the cursor for the next request. This is robust to new messages arriving mid-pagination (no shifting page numbers) and trivially cacheable.
// Read API GET /conversations/{conv_id}/messages?before_ts=1736400000&limit=50 → 200 OK { "messages": [...50 msgs...], "next_cursor": "1736399650" // ts of oldest msg returned }
The 15 most recent messages of every active conversation live in Redis ⑩, keyed by conv_id. When Sarah opens her thread with Raj, the chat server hits Redis first; cache hit returns in under 1ms, miss falls back to HBase and warms the cache. Every new message also pushes onto the Redis list (and evicts the oldest if length > 15) so the cache stays current automatically.
Every user has hundreds of friends; every friend's UI wants to show a green/grey dot. The naïve approach — broadcast every state transition to every friend — would melt at 500M users. Three techniques tame it.
When Sarah opens Messenger, the client sends one bulk request: getStatus([friend_ids]). Server returns the current status of all her friends in one round trip. No subscription, no per-friend chatter — a single read against the Presence Service's in-memory map.
Don't broadcast every micro-state-change. Push to friends only when the user transitions OFFLINE → ONLINE or ONLINE → OFFLINE. Heartbeat fluctuations don't count. Use a 5-second debounce on offline-detection so a quick subway-tunnel disconnect doesn't trigger a "left chat" broadcast.
When Sarah opens her thread with Raj specifically, the client refreshes Raj's status with a targeted call. This catches the case where Raj came online while Sarah's app was backgrounded and the broadcast didn't reach her.
3.6 PB doesn't fit on one box, and even if it did we couldn't survive its failure. Sharding strategy decides everything downstream — query latency, hot-spot risk, rebalancing cost.
Distributes individual messages randomly across shards. Each write hits one shard — perfect write distribution. But reading a chat history "last 50 messages between Sarah and Raj" now requires scanning every shard and merging — fan-out reads of O(num_shards). With 100 shards, every chat-open is 100 simultaneous queries. Unworkable.
All of one user's messages live on the same shard. Chat-history reads touch exactly one shard — O(1) shard count, O(50) result rows. This is the right answer for a read-by-conversation workload. Trade-off: a hyperactive user (a celebrity, a bot) creates a hot shard.
What if a celebrity's account gets 100K messages/min and burns down their shard? Two mitigations:
Use consistent hashing (not modulo) so adding a new shard only relocates 1/N of users instead of all of them. Spreads the "split the hot shard" operation over hours instead of days, with the cluster usable throughout.
For known-hot users, replicate their data to multiple read replicas. Reads round-robin across replicas; writes still go to the primary. Buys 3-5× headroom on read traffic for power users without re-sharding.
HDFS replicates each block 3× across racks by default — survives single-disk and single-rack failures with zero data loss. HBase region servers are stateless against the underlying HDFS, so a region-server crash means another server picks up the regions in seconds. For multi-region disaster recovery, HBase replication streams writes to a standby cluster in another data center asynchronously.
Things will break. The hard question is which kind of break causes data loss versus a brief interruption. Walk through each failure mode.
The 50K clients see their WebSocket drop. They auto-reconnect (with exponential backoff) and the LB routes them to a healthy chat server. The new server registers them in the routing table ④, and any messages that arrived during the gap are pulled from HBase on the first read. Connection state is replicated to a standby chat server in real time so the routing-table updates have minimal lag.
Data loss: zero. Messages already persisted to HBase are intact. In-flight messages still in Kafka get re-consumed. Messages that hadn't yet reached the chat server are retried by the client (idempotent).
HBase auto-detects via ZooKeeper, reassigns the regions to a healthy server, replays the WAL from HDFS. Recovery takes seconds to a minute. During the gap, writes to those regions queue in the chat-server WAL or Kafka and replay once the region is back. HDFS 3× replication ensures no block is lost.
Kafka topics are replicated across brokers (replication factor 3). A broker loss shifts leadership of its partitions to a replica in milliseconds. Producers retry transparently. At-least-once delivery is preserved — the dedup key catches the rare double-delivery that follows.
Redis is replicated and persisted, but in the worst case if we lose it entirely, we don't lose messages — we just lose the ability to push them in real-time. Recipients fall back to push notifications via APNS/FCM. The table rebuilds itself within seconds as clients reconnect and re-register. Brief degraded mode, no data loss.
Conceptually a group is just "1-on-1 chat with N people". Implementation-wise, the Group Chat Service ⑨ holds a GroupChat object: {group_id, name, members: [user_ids], created_at}. When Sarah sends a message to "Dinner Squad" (8 members):
One write to HBase (the message lives once in the group conversation's row family), N publishes to Kafka (one per recipient). Per-member delivered/read state is tracked separately in the Read Receipts service so Sarah can see "Raj read it, Alice didn't".
Cap on group size: typically 250-500 members for performance. Beyond that, fanout becomes a meaningful cost — broadcast-style channels (Slack, Discord) use a different architecture where members "subscribe" to a channel rather than the sender fanning out.
The Push Notification Service ⑧ is a Kafka consumer for "undeliverable" messages — the ones whose recipient routing-table lookup returned null. For each, it:
Push is asynchronous and lossy by design — a missed push isn't catastrophic because the message still pulls down from HBase when the user opens the app. But the latency is meaningful: a 5-10s wake-up time is normal, vs. 200ms for a live WebSocket.
Every message moves through a small state machine driven by acks from each side:
The UI checkmarks track this state directly: clock icon = SENDING, single grey check = SENT, double grey check = DELIVERED, double blue check = READ, red exclamation = FAILED.
conversation_id, so all messages for one conversation flow through one partition — Kafka guarantees in-order delivery within a partition. The chat server assigns a strictly-increasing timestamp before publishing, and HBase row keys use that timestamp so reads also return in order. Cross-conversation ordering isn't guaranteed (and doesn't need to be — they're independent threads).SEND frame from the client carries a UUID generated when the user first taps send. The chat server uses this UUID as part of the HBase row key — a duplicate write with the same key is a silent no-op. Kafka consumers also dedupe on this key in a small Redis set before pushing to recipients. So even if the chat server retries the publish or HBase write 5 times, the recipient sees the message exactly once.