← Back to Design & Development
High-Level Design

WhatsApp — End-to-End Encrypted Messaging

2 billion users, 100B messages a day, and servers that physically cannot read a single one — the architecture that turns the Signal Protocol into a global product

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →
Step 1

What is WhatsApp?

It's a Tuesday afternoon. Sarah is in New York, scrolling through photos from her trip, and she taps a picture of the Brooklyn Bridge to send to her mom in Mumbai. The instant her thumb leaves the screen, something quietly remarkable happens: that photo is wrapped — bytes scrambled into ciphertext using a key that lives only on her mom's phone — handed to a WhatsApp server that physically cannot read it, relayed across continents, and 600 milliseconds later her mom's phone unwraps it back into a picture. Nobody in between, including WhatsApp itself, ever sees the photo. That trick — running a global messenger that handles 100 billion messages a day while being mathematically blind to all of them — is what we're designing.

WhatsApp is a phone-number-identified, end-to-end encrypted messaging app: 2 billion users, 1.5 billion daily actives, around 100 billion messages per day, plus voice/video calls, status broadcasts, and now multi-device sync (your phone, web, and tablet all stay in sync without trusting the server). Think of the server as a sealed-envelope post office — it knows the from-address and the to-address, but the envelope itself is locked, and only the recipient has the key.

The two questions that drive every design decision below: (1) How does a server route a message to the right device(s) when it cannot read the message — not even the metadata of who-said-what-to-whom? (2) How do you sync messages across my four devices when each device must keep its own private key and the server holds nothing but ciphertext?
Step 2

Requirements & Goals

WhatsApp looks deceptively simple — a chat app — but every requirement below is a constraint that bends the architecture in ways an unencrypted messenger never has to deal with.

✅ Functional Requirements

  • 1-on-1 chats with text + media (photo, video, voice notes, files)
  • Group chats up to 1024 members
  • Read receipts (single tick → double tick → blue ticks) and online / last-seen presence
  • Voice and video calls, 1-on-1 and group
  • End-to-end encryption on every message and call (Signal Protocol)
  • Multi-device sync — phone + WhatsApp Web + desktop + tablet all show the same conversations
  • Status broadcasts — 24-hour ephemeral stories, also encrypted
  • Phone-number identity — no usernames, just +E.164 numbers verified via SMS OTP

⚙️ Non-Functional Requirements

  • End-to-end encryption — server is provably unable to read message contents
  • Low latency — under 500ms p99 for in-region delivery
  • High availability — 99.99%; messaging is critical infrastructure for 2B people
  • Scale — 1.5B DAU, 100B msgs/day, peaks of millions of messages per second
  • Offline delivery — recipient gets the message even if their phone was off when it was sent
  • Bandwidth-cheap — many users on slow 2G/3G networks; avoid retransmissions
The headline constraint is encryption. A non-encrypted chat app can do clever things on the server — search messages, deduplicate media, build a social graph, train a recommender. WhatsApp can do none of that, because the server only sees ciphertext. Every architectural choice from here on is shaped by "the server is blind, so where does this work happen?" The answer is almost always: on the device.
Step 3

Capacity Estimation & Constraints

The numbers below dictate sharding, fanout, and the choice of a persistent connection protocol over plain HTTP. Without them you cannot justify why we run 50K open sockets per server or why media has to live in S3 not the message store.

User & traffic estimates

Public stats: 2 billion users, 1.5B DAU, around 100 billion messages per day.

Messages/day

~100B

Roughly 65 msgs per DAU per day

Messages/sec avg

~1.16M/s

100B / 86400

Peak msgs/sec

~5M/s

3-5x avg during prime-time across regions

Concurrent connections

~500M+

1.5B DAU × ~33% online at any moment

Bandwidth estimates

Average text message is ~100 bytes (after encryption overhead). About 10% of messages carry media — average 1MB per media message but media flows through S3, not the messaging path.

Text bandwidth (in)

~116 MB/s

1.16M msgs/s × 100 bytes

Media bandwidth (in)

~116 GB/s

10% of msgs × 1MB → S3 directly

Fanout factor

~2.5×

Multi-device + groups multiply egress

Storage estimates

WhatsApp's storage model is unusual: messages are deleted from the server once delivered to all recipient devices. Only undelivered messages are stored, plus optional encrypted backups in iCloud / Google Drive (managed outside WhatsApp's infrastructure).

MetricValueWhy it matters
Concurrent sockets500M+Drives Connection Server count and per-server tuning (50K-1M sockets each)
Peak msgs/sec5M/sDrives sharding and Routing Service throughput
Media throughput116 GB/sForces S3-class blob storage; cannot live in the messaging path
Pending msg storage~5-10 TBOnly undelivered queue — small because messages are deleted post-delivery
User & device registry~500 GB2B users × ~250 bytes (phone, name, public keys per device)
Why these numbers force a non-HTTP protocol. 500M concurrent users on plain HTTPS would mean 500M TLS handshakes per minute as connections cycle — physics says no. Instead, WhatsApp keeps a persistent, multiplexed socket open per device (originally XMPP, now a custom binary protocol called Noise Pipes). One handshake, then bidirectional frames forever. We'll see this drive the Connection Server tier in §6.
Step 4

System APIs

WhatsApp's "API" is mostly a binary protocol over a persistent socket, not REST. But the conceptual operations are easy to enumerate. The signatures below are pseudocode — what each call does, who calls it, and what flows over the wire.

Conceptual API surface
// 1. REGISTER — phone-number identity, verified via SMS OTP
register(phone_number, device_metadata)
  → { user_id, otp_token }
verify_otp(otp_token, sms_code, device_public_key)
  → { auth_token, user_id, device_id }

// 2. GET KEYS — Signal Protocol pre-key fetch (X3DH)
// Sender calls this for each recipient device before sending the first message
get_keys(recipient_user_id)
  → [ { device_id, identity_key, signed_pre_key, one_time_pre_key } , ... ]

// 3. SEND MESSAGE — encrypted payload per recipient device
// Sender encrypts ONCE PER DEVICE in the recipient set
send_message({
  conversation_id,
  recipient_devices: [
    { user_id, device_id, encrypted_payload },
    { user_id, device_id, encrypted_payload },
    ...
  ],
  timestamp
})
  → { message_id, server_timestamp }

// 4. UPLOAD MEDIA — E2E-encrypted blob, key shipped inside the message
upload_media(content_type, encrypted_bytes)
  → { s3_url, content_hash }

// 5. SUBSCRIBE PRESENCE — last-seen / online indicators
subscribe_presence(user_id)
  ← { user_id, online: bool, last_seen: timestamp }

// 6. PUSH RECEIPT — message delivered / read
ack_receipt(message_id, type: "delivered" | "read")
Notice what's NOT in the API. There's no get_message_history — the server doesn't keep history, the device does. There's no search_messages — the server cannot, only the device can search its own decrypted archive. There's no get_group_members_chat — group history lives only on members' devices. The API surface is starkly minimal because the server simply does not have most of the data a non-encrypted messenger would expose.
Why a separate get_keys call? Before Sarah can send her first message to her mom, her phone needs Mom's identity key, signed pre-key, and a one-time pre-key to run the X3DH handshake (Signal's initial key agreement). Mom may be offline at that moment — that's fine, her keys were uploaded to the Key Distribution Service ⑥ when she registered. This is the trick that lets Signal work asynchronously between people who have never connected at the same time.
Step 5

Database Design

WhatsApp's data model has a quirky shape: the user/device registry is small and relational (good for PostgreSQL), the pending-message queues are huge but ephemeral (good for a Cassandra / HBase / Mnesia-style store), and media lives outside the DB entirely (S3 / blob storage). Each datastore is picked for the access pattern, not for uniformity.

erDiagram USER { string phone_number PK string user_id string display_name timestamp created_at timestamp last_seen } DEVICE { string device_id PK string user_id FK string identity_public_key string signed_pre_key string platform timestamp registered_at } ONE_TIME_PRE_KEY { string key_id PK string device_id FK string public_key bool consumed } CONVERSATION { string conversation_id PK string type timestamp created_at } PENDING_MESSAGE { string message_id PK string recipient_device_id FK string sender_device_id bytes encrypted_payload timestamp ts } MEDIA_BLOB { string blob_id PK string s3_url string content_hash int size_bytes } USER ||--o{ DEVICE : "owns" DEVICE ||--o{ ONE_TIME_PRE_KEY : "uploads" DEVICE ||--o{ PENDING_MESSAGE : "to-deliver" CONVERSATION ||--o{ PENDING_MESSAGE : "carries"

Datastore choices

🟢 PostgreSQL — user & device registry

Small, relational, latency-sensitive. ~500GB total. Used to look up "what devices does +1-555-... have, and what are their public identity keys?" Sharded by phone number; each shard replicated across regions.

🟡 Cassandra / HBase — pending message queues

Per-device offline message queue. Wide-column, cheap appends, sharded by device_id. Rows are deleted once the recipient's connection has acked delivery — rarely grows, even at 100B msgs/day.

🟣 S3-class object store — media blobs

Photos, videos, voice notes, files. Stored as encrypted bytes — the AES key lives only inside the per-message E2E payload, never on the server. WhatsApp can host petabytes of media it cannot decrypt.

The "to-deliver" queue is the unit of scale. When Sarah sends to a 50-person group, the message becomes 50 rows in the pending-message table — one per recipient device. Each row is the ciphertext encrypted with that specific device's session key. As soon as a recipient's phone reconnects and acks, that row is deleted. Storage stays bounded because messages don't accumulate; they flow through.
Step 6 · CORE

High-Level Architecture — From Naive to Production

This is the section that decides whether you understand WhatsApp or just kind of know what it is. We build the system in three passes: a naive central server that a textbook would draw, the failure modes that force the production split, and the final shape where every component earns its keep.

Pass 1 — The naive design (and why it breaks)

A textbook chat server: clients open an HTTP connection, POST messages to a central server, the server stores them in a database and forwards them to the recipient. Just like SMS, but on the internet.

flowchart LR S["Sarah's Phone"] --> SRV["Central Chat Server"] SRV --> DB[("MySQL — all messages")] SRV --> M["Mom's Phone"]

Walk this through with two billion users and four concrete failures fall out:

💥 The server can read every message

This is the privacy failure, and it's the one that disqualifies the design entirely. Every photo, every voice note, every message between a journalist and a source — visible to whoever runs the server, plus anyone who breaches it, plus any government that subpoenas it. WhatsApp's central promise breaks before you even get to scale.

💥 Recipient is offline → message is lost

Mom's phone is on a 2G network in Mumbai and her battery died at 3am. Sarah's message hits the server, the forward to Mom fails, and the textbook design just... drops it. We need a durable per-device offline queue that holds the message until Mom's phone comes back online, sometimes 12 hours later.

💥 1M msgs/sec crushes one server

A single beefy server handles maybe 50K open WebSocket connections and a few hundred thousand msgs/sec at best. We need 500M+ concurrent connections and 5M peak msgs/sec. That's a horizontal-scaling problem — but if we just add more chat servers, how does Sarah's server know which server holds Mom's connection right now?

💥 No multi-device — Mom only has her phone

Mom uses WhatsApp Web at her desk and the app on her phone. The naive design has one "Mom" — when Sarah sends, it goes to "Mom's connection." But Mom has two connections, both wanting the message, and each runs its own private key. The server cannot just clone the ciphertext to both — they were encrypted to different keys.

Pass 2 — The mental model: encrypt on the device, fan out per-device on the server

The single insight that makes WhatsApp possible is this: encryption happens on the devices, not on the server. Each device has its own key pair. The server is reduced to a router of opaque ciphertext. When a user has multiple devices, the message is encrypted multiple times — once per recipient device — before it leaves the sender. The server fans out the pre-encrypted ciphertexts to the right device queues.

🔐 The Encryption Plane (on devices)

Where the actual privacy happens. Each device runs the Signal Protocol — X3DH for the initial handshake, Double Ratchet for per-message forward secrecy. To send to Mom (who has 3 devices), Sarah's phone produces 3 ciphertexts, one per device. The server never sees any plaintext. Even if the server is fully compromised, past messages stay secret.

📡 The Connection & Routing Plane (on servers)

Where the scaling happens. A pool of Connection Servers each holds tens of thousands of persistent sockets. A Routing Service maps (user_id, device_id) → which connection server holds them right now. To deliver, you look up the route, push the ciphertext blob to that connection server, and let it write to the device's socket. Server has zero visibility into the payload.

What travels where: plaintext lives only on devices. Ciphertext travels both directions over persistent sockets. The Routing Service has metadata (who is connected where) but never payload. Media bytes travel as a separate flow to S3, also pre-encrypted by the sender. This split lets the encryption story stay airtight while the messaging fabric scales like a normal pub-sub system.

Pass 3 — The production shape

The full architecture, with every node numbered. Match each circled digit in the diagram to a card below to see what the component does, why it exists, and what would break without it.

flowchart TB CL["① Client — Phone, Web, Desktop, Tablet"] subgraph EDGE["Edge"] LB["② L4 Load Balancer"] end subgraph CONN["Connection Plane"] CS["③ Connection Server cluster — 50K+ sockets each"] RT["④ Routing Service — user_id, device_id → conn_server"] end subgraph IDENT["Identity & Keys"] ID["⑤ Identity Service — phone OTP, registration"] KDS["⑥ Key Distribution Service — pre-keys per device"] end subgraph STORE["Storage Plane"] MS["⑦ Pending Message Store — per-device offline queue"] GR["⑨ Group Chat Service — per-device fanout"] PR["⑩ Presence Service — online, last-seen"] end subgraph MEDIA["Media Plane"] MD["⑧ Media Service — presigned S3 URLs"] S3[("S3 — encrypted blobs")] end subgraph CALLS["Calls"] VC["⑪ Voice / Video Service — WebRTC signaling, TURN"] end PUSH["⑫ Push Notification Service — APNS / FCM"] CL <--> LB LB <--> CS CS <--> RT CS <--> MS CS <--> GR CS <--> PR CL --> ID CL --> KDS CS -.lookup keys.-> KDS CL --> MD MD --> S3 CS --> VC CL <--> VC MS -.wake offline phones.-> PUSH PUSH -.APNS/FCM.-> CL style CL fill:#e8743b,stroke:#e8743b,color:#fff style LB fill:#171d27,stroke:#9b72cf,color:#d4dae5 style CS fill:#171d27,stroke:#4a90d9,color:#d4dae5 style RT fill:#171d27,stroke:#4a90d9,color:#d4dae5 style ID fill:#171d27,stroke:#38b265,color:#d4dae5 style KDS fill:#171d27,stroke:#38b265,color:#d4dae5 style MS fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style GR fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style PR fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style MD fill:#171d27,stroke:#9b72cf,color:#d4dae5 style S3 fill:#171d27,stroke:#9b72cf,color:#d4dae5 style VC fill:#171d27,stroke:#d4a838,color:#d4dae5 style PUSH fill:#171d27,stroke:#e05252,color:#d4dae5

Component-by-component — what each numbered box does

Each card answers the four newbie questions: what is it, why does it exist, what would break without it, and where does it sit in the flow.

Client (Phone, Web, Desktop, Tablet)

The client is where the entire encryption story lives. It runs the Signal Protocol library, holds the device's private identity key (which never leaves the device), maintains a chat database on local disk, and keeps a persistent socket open to a Connection Server. Sarah's phone, her laptop, and her tablet are three independent clients, each with its own key pair.

Solves: the privacy guarantee. Because the server holds no private key, even a full server breach cannot decrypt past messages. Every architectural decision elsewhere assumes the heavy work of encryption, decryption, and indexing happens here.

L4 Load Balancer

A layer-4 (TCP) load balancer that terminates TLS and forwards the persistent socket to a Connection Server. L4 not L7 because the traffic is a custom binary protocol, not HTTP — there are no URLs to inspect, just raw frames. Health-checks every few seconds and pulls dead Connection Servers from rotation.

Solves: the entry point at scale. Without the LB, clients would have to know exact Connection Server IPs — meaning you couldn't add or rotate servers without updating every device. The LB also gives you a stable DNS name like g.whatsapp.net behind which the server fleet can change freely.

Connection Server cluster

Each Connection Server holds a small army of persistent sockets — historically tens of thousands per box, and on FreeBSD-tuned WhatsApp Erlang servers, famously up to 2 million on one host. When Sarah's phone is online, exactly one socket on exactly one Connection Server is hers. Inbound messages from her are read off that socket; outbound messages to her are written to that socket.

Solves: the "500M concurrent users" problem. HTTP request-response cannot scale to that many open conversations because every request pays a TLS-handshake tax. Persistent multiplexed sockets pay that tax once per device-week, then send millions of frames over the same socket. This is also where presence ("online") naturally lives — a socket open means online; closed means last-seen.

What if a Connection Server dies? The 50K-1M clients on it reconnect (their phones detect the socket drop within seconds), the Load Balancer routes them to a healthy server, and the Routing Service updates their (user_id, device_id) → conn_server mapping. A few seconds of perceived latency on those devices, no message loss because pending messages are stored durably elsewhere.

Routing Service

A high-throughput key-value service whose only job is to answer "which Connection Server is currently holding the socket for (user_id, device_id)?" Backed by Redis or an in-memory distributed store like Mnesia, partitioned by user_id. Updated whenever a client connects or disconnects.

Solves: message delivery across a horizontally scaled fleet. When Sarah's Connection Server in São Paulo needs to push her message to Mom in Mumbai, it does not broadcast to all servers — it asks Routing Service "where is Mom's phone right now?" and gets back conn-mum-42. Without this lookup, you would either fanout to every server (impossibly wasteful) or pin all of a user's traffic to a fixed server (no fault tolerance).

Identity Service

Phone-number registration. When you install WhatsApp, this service sends an SMS OTP to your phone, you type it back, and the service issues an auth_token tied to (phone_number, device_id). Phone numbers are the user-facing identity — there are no usernames. Stores the +E.164 phone, a stable opaque user_id, and the device list.

Solves: bootstrapping identity. Without it, anyone could register as any phone number, defeating contact-discovery (the killer feature: "all your phone contacts who are on WhatsApp appear automatically"). The OTP gates registration to the actual SIM holder.

Key Distribution Service

The Signal Protocol's X3DH handshake requires the sender to fetch three things about the recipient device: an identity public key (long-lived), a signed pre-key (rotated weekly), and a one-time pre-key (consumed per first-message). Devices upload batches of one-time pre-keys when they register and refill the pool periodically. The Key Distribution Service stores and serves these.

Solves: asynchronous handshakes. Sarah may want to send Mom a message while Mom's phone is off — Signal still needs Mom's keys to derive a session. Pre-keys uploaded ahead of time make this work. Without the pre-key system, you could not start a Signal session with someone who is not currently online — Signal would devolve into a real-time-only protocol.

Pending Message Store

A per-device offline queue. When Sarah sends to Mom (who has 3 devices, all currently offline), the server writes 3 rows: (mom-phone-device, ciphertext-A), (mom-laptop-device, ciphertext-B), (mom-tablet-device, ciphertext-C). Each row sits there until that device reconnects and acks delivery, at which point the row is deleted. Backed by Cassandra/HBase, sharded by device_id.

Solves: "recipient is offline." Without a per-device queue, Sarah's message would be dropped any time Mom's phone was off. With it, Mom can be on a flight for 8 hours and still receive every message in order when she lands.

Media Service

Photos, videos, voice notes, and files don't travel through the messaging path — they would clog it. Instead, Sarah's phone encrypts the photo with a fresh random AES key, uploads the ciphertext to S3 via a presigned URL handed out by the Media Service, and gets back a blob URL. She then sends Mom a regular text message containing { s3_url, aes_key } — and that message is itself end-to-end encrypted with Mom's device key. Mom's phone fetches the blob from S3 and decrypts it locally.

Solves: media at scale, while preserving E2E. The server sees encrypted bytes in S3 and an encrypted little message in the queue — never the photo. Without this split, every 1MB photo would have to flow through the Connection Server fabric, multiplying messaging bandwidth by 100×.

Group Chat Service

Group fanout. When Sarah sends to a 50-member group, her client encrypts once per recipient device using the group's sender key (an additional Signal construct), and the Group Chat Service knows the group's device list and writes one Pending Message row per recipient device. New members get re-keyed; departed members are cut off via a sender-key rotation.

Solves: efficient group encryption. Naive approach (encrypt N times for N members) works up to ~10 members but bogs down at 256+. Sender keys let Sarah encrypt the message body once with a symmetric key and only encrypt that key per recipient device — far cheaper. Without the Group Chat Service, every 1024-member group message would spawn a thousand-way client-side fanout that the sender's phone could not afford.

Presence Service

Tracks "online" and "last-seen" indicators. Subscribes to socket connect/disconnect events from Connection Servers and pushes updates to subscribed clients ("Mom went online at 14:03"). Stored in a fast in-memory cache because presence is ephemeral and high-churn.

Solves: the small UX touches that make a chat app feel alive — the green dot, the typing indicator, "last seen 2m ago." Without it, you would not know whether your message was being read in the moment. Importantly, presence is opt-in and does not leak message content — it is a bit of metadata, not plaintext.

Voice / Video Service

WebRTC for the actual media (audio/video frames travel peer-to-peer when both endpoints can punch through NAT). Call setup signaling — "Sarah is calling, here are her ICE candidates" — flows through the WhatsApp servers, but is itself end-to-end encrypted. When direct P2P fails (strict corporate NATs, mobile carriers), traffic falls back to TURN relay servers run by WhatsApp, which still see only encrypted media.

Solves: low-latency real-time calls without compromising encryption. Without WebRTC + TURN, calls would have to flow through an MCU (Multipoint Control Unit) that mixes audio — which would require decryption on the server. Peer-to-peer media keeps the encryption end-to-end.

Push Notification Service

If Mom's phone has its WhatsApp socket closed (app backgrounded, OS suspended it to save battery), the server cannot push messages directly — the OS owns the radio. Instead, the Pending Message Store fires a notification through APNS (iOS) or FCM (Android). The notification wakes the app, the app reconnects its socket, and pulls the queued ciphertexts. The notification itself contains no message content — just a "you have messages" wake-up signal.

Solves: mobile reality. Phones aggressively suspend background apps; without OS-level push, your friends could message you all day and your phone would not buzz until you opened the app. APNS/FCM is the only legal way to bypass this. The cost: WhatsApp depends on Apple and Google's infrastructure for the wake-up step, but never trusts them with content.

Concrete walkthroughs — Sarah, Mom, and the Brooklyn Bridge photo

Two flows. First, a 1-on-1 photo send to Mom (single device, simple case). Second, the same photo to a 50-person group chat. Both reference the numbered components above.

📷 1-on-1 photo to Mom (1 device) at 14:02:06 NY time

  1. Sarah's phone ① picks a fresh random AES key, encrypts the JPEG with it, and uploads the ciphertext to Media Service ⑧ → returns s3://wa-media/abc123.bin.
  2. Sarah's phone has no Signal session with Mom yet today, so it fetches Mom's pre-keys from Key Distribution Service ⑥ — gets {identity_key, signed_pre_key, one_time_pre_key} for Mom's phone device.
  3. Sarah's phone runs X3DH to derive a shared session key, then uses Double Ratchet to encrypt the small message { s3_url, aes_key, "from Brooklyn ❤️" } with Mom's session key. Result: ~200 bytes of ciphertext.
  4. Sarah's phone sends the ciphertext over its persistent socket to Connection Server ③ in São Paulo (her current geo).
  5. São Paulo connection server asks Routing Service ④ "where is Mom's phone?" → answer: Mumbai connection server conn-mum-42. But Mom's phone is currently offline (asleep), so the entry says "no socket."
  6. São Paulo connection server writes the ciphertext to Pending Message Store ⑦ keyed by Mom's device_id, and triggers Push Notification Service ⑫ → APNS wakes Mom's iPhone with "1 new message."
  7. Mom's iPhone reconnects its socket to conn-mum-42. Connection Server reads the pending row, writes the ciphertext to Mom's socket.
  8. Mom's phone runs Double Ratchet decrypt → gets {s3_url, aes_key, text}. Phone fetches the encrypted blob from S3, decrypts with the AES key, displays the photo. Total elapsed ~600ms from Mom's phone waking.

👥 Same photo to a 50-person group chat (avg 1.5 devices each ≈ 75 devices)

  1. Sarah's phone ① uploads the encrypted photo to Media Serviceonce — same as 1-on-1. The S3 URL and AES key will be sent inside each per-device message.
  2. Sarah's phone fetches the group's device list from Group Chat Service ⑨ — 75 (user_id, device_id) pairs.
  3. Using the group's sender key (Signal's group construct), Sarah's phone encrypts the message body once with a symmetric key, then encrypts that key for each of the 75 devices. Result: 1 body ciphertext + 75 small key envelopes.
  4. Sarah's phone sends the bundle to her Connection Server ③, which iterates over the 75 envelopes:
    • For online devices → Routing Service ④ lookup → push directly to that device's Connection Server.
    • For offline devices → write to Pending Message Store ⑦ + trigger Push Notification Service ⑫.
  5. Each recipient device decrypts the small envelope to recover the sender key, then decrypts the body. Each device fetches the photo from S3 and decrypts with the embedded AES key. End-to-end encrypted to every device with one body ciphertext + 75 tiny envelopes.
So what: WhatsApp's architecture is built around three insights — (1) encryption lives on devices, so the server is just a router of opaque blobs; (2) routing is decoupled from session-holding via a Routing Service that maps users to whichever Connection Server they're on right now; (3) media flows through S3 separately from the messaging path, with the AES key shipped inside the encrypted message. Every box in Pass 3 exists to remove one of the four Pass-1 failures while keeping the server provably blind to content.
Step 7

End-to-End Encryption — The Signal Protocol

WhatsApp's encryption isn't bespoke — it's the open Signal Protocol, designed by Open Whisper Systems / Moxie Marlinspike and audited by the academic crypto community. The protocol does two jobs: set up a session when two devices first talk (X3DH), and encrypt every message with a fresh per-message key that gets thrown away after use (Double Ratchet).

X3DH — the asynchronous handshake

Imagine Sarah wants to start a chat with her friend Raj, who is offline. In a typical TLS-style handshake, both parties must be online at the same time to exchange ephemeral keys. Signal can't require that — Raj might be on a plane. X3DH (Extended Triple Diffie-Hellman) solves this by having Raj pre-upload a bundle of public keys to the Key Distribution Service ⑥ when he registers his device:

🔑 Identity Key

Long-lived. Signs the pre-key. This is the public key Sarah sees on the "verify Raj" QR code.

🔑 Signed Pre-Key

Rotated weekly. Mid-term key, signed by the identity key so Sarah knows it's authentic.

🔑 One-Time Pre-Keys

Pool of 100+. Each used at most once. The one-time-ness gives forward secrecy for the very first message.

When Sarah wants to message Raj, her phone fetches Raj's bundle and runs three Diffie-Hellman computations whose results are mixed together to derive a shared session key. Crucially, Raj never had to be online — his pre-keys were sitting on the server waiting. As soon as Raj's phone wakes, it pulls Sarah's first message + her ephemeral key, runs the matching DH on its end, and arrives at the same session key.

Double Ratchet — per-message forward secrecy

X3DH gives you a session key, but if that single key encrypted every message, a future compromise of either device would expose the whole chat history. Double Ratchet rotates the encryption key on every message. The "double" refers to two ratchets that mix together:

🔄 Symmetric ratchet (per message)

After each message, derive the next message key from a hash of the current one and immediately destroy the previous. Even if today's message key leaks, yesterday's messages cannot be re-derived from it.

🔄 Diffie-Hellman ratchet (per round-trip)

Every time a new message arrives in the other direction, a fresh DH exchange piggybacks on it, generating a new "chain key." This bounds how long a compromise can listen — at most until the next time the conversation goes the other way.

sequenceDiagram actor Sarah as Sarah (NY) participant KDS as Key Distribution ⑥ participant CS as Connection Server ③ participant MS as Pending Msgs ⑦ participant Push as Push (APNS) ⑫ actor Raj as Raj (offline → online) Note over Raj,KDS: Raj registered earlier — pre-keys uploaded Sarah->>KDS: get_keys(raj_user_id) KDS-->>Sarah: identity_key, signed_pre_key, one_time_pre_key Note over Sarah: Run X3DH → derive session key
Encrypt with Double Ratchet Sarah->>CS: send_message(ciphertext for raj_phone) CS->>MS: write pending row(raj_device, ciphertext) CS-->>Push: notify raj Push-->>Raj: APNS wake-up Note over Raj: phone wakes, reconnects Raj->>CS: open socket CS->>MS: read pending rows MS-->>CS: ciphertext CS-->>Raj: deliver ciphertext Note over Raj: Run X3DH on his side → derive same session key
Decrypt Raj->>CS: ack delivered CS->>MS: delete row

What the server sees vs. what the server cannot see

Server CAN seeServer CANNOT see
Sender phone numberMessage text or media bytes
Recipient phone number(s)Whether two messages have the same content
Timestamp of sendWhether a media file is the same as a previously sent one
Message size (in ciphertext)Group chat names, member descriptions
That a call happened (signaling)The audio/video of the call
Trade-off the server pays for being blind: backup and restore. WhatsApp cannot help you migrate to a new phone, because it has no readable copy of your history. The workaround: encrypted backups to iCloud / Google Drive, where the user holds the encryption key (a recently introduced 64-digit recovery code, or a password). The cloud provider stores the ciphertext but doesn't have the key. A weaker form of the same end-to-end principle, applied to backups.
Step 8

Multi-Device Sync

For years, WhatsApp had a strict 1-user-1-device model: WhatsApp Web was a "mirror" of the phone, and if the phone was off the laptop did not work. In 2021 they shipped multi-device, and the architecture had to bend in interesting ways to keep encryption end-to-end.

The naive (rejected) approach: extract the private key from the phone

The simplest thing would be: when Sarah pairs her laptop, copy her phone's private identity key over to the laptop. Now the laptop "is" the phone. But this weakens the security model — the key now exists in two places, and any compromise of either device exposes the other. It also makes the key transit (over the network) a juicy target.

The actual approach: each device gets its own key pair

When Sarah pairs her laptop, the laptop generates its own identity key locally. Pairing only proves to the WhatsApp servers (and to her contacts) that "this new device is also Sarah." The laptop is treated as a peer device of Sarah's account, not a clone. From the protocol's view, Sarah is multiple devices.

sequenceDiagram actor Sarah_Phone as Sarah's Phone (already trusted) actor Sarah_Laptop as Sarah's Laptop (new device) participant KDS as Key Distribution ⑥ participant ID as Identity Service ⑤ Sarah_Laptop->>Sarah_Laptop: generate new identity_key pair Sarah_Laptop->>ID: show pairing QR Sarah_Phone->>Sarah_Phone: scan QR — get laptop's public key Sarah_Phone->>ID: sign laptop's public key with phone's identity key
("Sarah's phone vouches: this laptop is Sarah") ID->>KDS: register laptop as a new device under sarah_user_id Sarah_Laptop->>KDS: upload signed_pre_key + one_time_pre_keys

What this means for senders

When Raj sends Sarah a message, his phone fetches the device list for Sarah from the Key Distribution Service ⑥ — say, 4 devices: phone, laptop, tablet, desktop. Raj's phone runs Double Ratchet for each Sarah-device and produces 4 ciphertexts. The server fans out 4 rows, one per device queue. Each Sarah-device decrypts independently using its own private key.

sequenceDiagram actor Raj as Raj's Phone participant KDS as Key Distribution ⑥ participant CS as Connection Server ③ participant MS as Pending Msgs ⑦ actor S_Phone as Sarah Phone actor S_Laptop as Sarah Laptop actor S_Tablet as Sarah Tablet actor S_Desk as Sarah Desktop Raj->>KDS: get_devices(sarah_user_id) KDS-->>Raj: [phone, laptop, tablet, desktop] Note over Raj: encrypt 4 times,
once per device session Raj->>CS: send_message(4 ciphertexts) par fanout CS->>MS: row for phone_device CS->>MS: row for laptop_device CS->>MS: row for tablet_device CS->>MS: row for desktop_device end MS-->>S_Phone: ciphertext_1 MS-->>S_Laptop: ciphertext_2 MS-->>S_Tablet: ciphertext_3 MS-->>S_Desk: ciphertext_4

Self-fanout — sending FROM one device must update the OTHER devices

Sarah sends a message from her phone. Her laptop also needs to show "you sent: hi mom" in the chat thread. So when Sarah's phone sends to Mom (4 devices), it also encrypts the message to its own peer devices — laptop, tablet, desktop. That's 4 ciphertexts to Mom plus 3 ciphertexts to Sarah's other devices = 7 ciphertexts for one logical "send." The server fans them all out the same way.

🆕 New device → can it read past messages?

No. A laptop paired today does not gain Sarah's phone's history. The phone can optionally re-send some recent messages to the laptop (encrypted to the laptop's new key), and that's how WhatsApp Web shows recent threads after pairing — but nothing forces the phone to do this, and old conversations remain phone-only unless explicitly synced. This is the privacy floor: a stolen-then-paired device cannot retroactively read what came before its pairing.

❌ Removed device → revoke the keys

When Sarah removes her stolen laptop from her account, the Identity Service ⑤ marks that device deleted, the Key Distribution Service ⑥ stops returning its pre-keys, and contacts' phones refresh their device lists on next send. The laptop's keys remain valid for already-sent-but-undelivered messages, but no new messages will be encrypted to it.

Step 9

Group Chat Encryption

Encrypting a message for a 256-person group naively means encrypting it 256 times — once per device session. For small groups that's fine. For 1024-member groups (with multi-device, that's 1500+ recipient devices) it's a serious CPU + bandwidth burden on the sender's phone. Signal's Sender Keys construct fixes this.

Sender Keys — a hybrid approach

For each group, every member's device generates a sender chain key: a symmetric key used to encrypt outgoing messages from that device to that group. The first time Sarah sends to "Family Group," her phone:

  1. Generates a fresh sender chain key for "Family Group."
  2. Uses each member's pairwise Signal session (the one Double-Ratchet session per device pair) to distribute the sender key to every member device — encrypted individually per device.
  3. From that point on, each new message from Sarah is encrypted once with the sender chain key, and the same ciphertext is fanned out to all members.

The sender chain key itself ratchets per-message (forward secrecy preserved). Other members who want to send to the group each have their own sender chain key, distributed once.

📥 New member joins

The new member's device gets the current sender keys from each existing member, via pairwise Signal sessions. They can decrypt messages from this point forward — never older messages.

📤 Member leaves

All remaining members rotate their sender keys so the departed member can no longer decrypt new messages. This is one of the more expensive operations in WhatsApp — N members each redistributing a new sender key to N-1 others.

📦 Server's role

Group Chat Service ⑨ stores only the device list. Server fans out the (already encrypted) message to every member device's Pending Message Store ⑦ row. It cannot read the body — just like 1-on-1.

Why this scales: for a 256-member group with avg 1.5 devices each (= 384 device sessions), naive Signal would do 384 Double Ratchet encryptions per message — heavy on a phone. With sender keys, it's 1 symmetric encryption + 0 setup overhead per message after the initial sender-key distribution. The cost is paid once at group creation / member join, then amortized over thousands of messages.
Step 10

Voice & Video Calls

Calls are a fundamentally different beast from text — they need real-time, low-latency, peer-to-peer media at 100s of kbps continuously. They use WebRTC, a web-standard P2P media stack, with WhatsApp servers handling only signaling and NAT-traversal fallback.

Call setup — signaling through WhatsApp

When Sarah taps "video call" on Mom's chat:

  1. Sarah's phone gathers ICE candidates — the list of network addresses she could be reached at (local LAN IP, public NAT IP, TURN-relayed IP).
  2. This candidate list is sent through WhatsApp's normal messaging path (encrypted to Mom) via the Voice/Video Service ⑪. Mom's phone gets a notification "incoming call from Sarah" + the ICE candidates.
  3. Mom's phone gathers her own ICE candidates and sends them back. Both sides now know each other's possible addresses.
  4. WebRTC tries pairs of candidates: direct LAN, then NAT-punched UDP, then TURN-relayed. Whichever connects first wins.
  5. Once a path is established, audio/video frames flow peer-to-peer, bypassing WhatsApp servers entirely.

When P2P fails — TURN relays

Some networks (corporate firewalls, certain mobile carriers, symmetric NATs) block P2P UDP. In that case, both phones connect to a WhatsApp TURN relay server and forward their media through it. The TURN server is a dumb packet pump — it sees only the encrypted SRTP streams, not the audio. So even relayed calls remain end-to-end encrypted; the relay just shuffles bytes.

flowchart LR S["Sarah's Phone"] M["Mom's Phone"] V["Voice/Video Service ⑪
(signaling only)"] T["TURN Relay
(fallback packet pump)"] S <-.signaling.-> V V <-.signaling.-> M S <==direct P2P media==> M S -.if NAT blocks.-> T T -.if NAT blocks.-> M style S fill:#e8743b,stroke:#e8743b,color:#fff style M fill:#e8743b,stroke:#e8743b,color:#fff style V fill:#171d27,stroke:#d4a838,color:#d4dae5 style T fill:#171d27,stroke:#d4a838,color:#d4dae5

Group calls — Selective Forwarding Unit (SFU)

For 1-on-1, P2P is enough. For group calls (8+ participants), full mesh P2P would mean each phone uploads its video stream N-1 times — kills mobile data plans. WhatsApp uses an SFU: each participant sends one upstream stream to the SFU, and the SFU forwards it to all other participants. The SFU does not decrypt — it only routes encrypted frames. (Compare to an MCU, which decodes and mixes — that would break E2E.)

The honest caveat on group calls: while the streams remain individually E2E-encrypted between sender and each receiver, an SFU sees enough metadata (frame sizes, timing) to do some traffic analysis. For most users this is a non-issue; for a journalist on a sensitive call, it's a known limitation. WhatsApp offsets this with calls being short-lived and ephemeral.
Step 11

Storage at Scale

Compared to other messengers, WhatsApp's server-side storage footprint is small — because most data lives on user devices, not on the server. Three storage tiers, each with very different growth curves.

🟢 User & device registry — small, durable

~500GB for 2B users × ~5 devices each × ~50 bytes per device record. PostgreSQL, sharded by phone number. Replicated across regions. Grows linearly with user count, not message count.

🟡 Pending message queues — bounded

~5-10TB. Cassandra/HBase, sharded by device_id. Each row is one undelivered message for one device. Rows are deleted on delivery ack, so the table is flow-through, not accumulating. Even at 100B msgs/day, the queue rarely exceeds ~100M rows at any given second.

🟣 Media blobs — petabyte scale

S3-class object store. Petabytes of encrypted media. Lifecycle policy: media is kept for ~30 days after delivery (so latecomers / new-device-syncs can fetch), then deleted. The encryption key is per-message and lost from the server immediately — after deletion, even subpoena-recovered S3 bytes are unreadable.

Why "delete on delivery" works

The fundamental architectural choice is: the server is a pipe, not an archive. A user's full message history lives on their device(s). Once Mom's phone acks "got the photo," the server's copy is gone. This is what keeps the server storage manageable — without it, 100B msgs/day × 5 years would be 180 trillion messages the server would need to durably hold, an unmanageable storage cost.

The flip side is the trade we already saw in §7: backup-and-restore must be implemented separately, by encrypted backups to iCloud/Google Drive, where the user holds the key.

Step 12

Data Partitioning

2B users, 500M concurrent connections, 5M peak msgs/sec — there is no single-box version of any tier. Everything is sharded.

📞 Connection Servers — sharded by user_id (consistent hash)

The Routing Service maps user_id → connection_server_id using consistent hashing on a virtual ring. Adding a new Connection Server only relocates 1/N of users (a brief reconnect for them), not the entire fleet. When a server dies, its share is redistributed and the affected clients reconnect.

📥 Pending Message Store — sharded by device_id

Why device_id and not user_id? Because each device is its own delivery target — we want all of one device's pending messages co-located on one shard for fast read on reconnect. Cassandra wide-column rows keyed by (device_id, message_id), replicated 3× per shard.

🆔 User registry — sharded by phone_number

Phone numbers are uniformly distributed (modulo country-code skew), so simple hash sharding works. A user lookup is one shard hit. Replicated across regions for low-latency contact discovery.

🔑 Key Distribution — sharded by user_id

Pre-key bundles for one user live on one shard. When Sarah's phone fetches Mom's keys, that's a single shard read. Pre-key pool refills from the device write a small batch back to the same shard.

Why consistent hashing matters specifically for Connection Servers: the fleet scales up and down throughout the day (more capacity needed in the evening when more phones are on). With consistent hashing on the virtual ring, adding a node only forces ~1/N of users to reconnect to a different server — usually a sub-second blip. With plain modulo sharding, every scale-up event would force all users to reconnect simultaneously — a thundering herd that could saturate the LB.
Step 13

Fault Tolerance & Multi-Region

Messaging is critical infrastructure: 99.99% means no more than 53 minutes of downtime per year. Failure modes are handled at every tier.

🔁 Connection Server failure

50K-1M sockets drop simultaneously. Affected clients detect the TCP RST within seconds and reconnect via the Load Balancer to a healthy server. Routing Service ④ updates the new mappings. Pending messages were stored elsewhere, so no message loss — just a few seconds of perceived offline.

🗄️ Pending Message Store failure

Replicated 3× per shard via Cassandra. Quorum writes (W=2) mean a single replica failure doesn't block writes. Quorum reads (R=2) heal divergent replicas on the fly. Region-level outages handled by cross-region replication.

🧭 Routing Service failure

The most sensitive tier — without it, no message can be routed. Backed by ZooKeeper / etcd or Mnesia for consistent metadata. Quorum-based — tolerates minority node loss. Latency-sensitive enough that reads are usually served from in-memory Erlang / Redis caches at the Connection Server.

🌍 Multi-region active-active

Connection Servers in NA, EU, APAC, LATAM. Clients connect to the geographically nearest region. Cross-region delivery flows: Sarah (NA region) → her local Connection Server → Routing Service says Mom is in APAC → push to APAC region's Connection Server → Mom's socket. Adds ~150ms intercontinental latency, still well under 500ms p99.

Disaster scenario — entire region offline

Suppose APAC's Mumbai data center goes dark. Mom's phone immediately loses its socket. Within seconds, her phone's reconnect attempt routes (via DNS / GeoIP) to the next-closest region — say, Singapore. Routing Service updates (mom_user, mom_device) → conn-sg-7. New incoming messages for Mom from anywhere in the world now flow to Singapore. When Mumbai comes back, traffic gradually rebalances. No message loss because the Pending Message Store is multi-region replicated.

Step 14

Interview Q&A

How does the server fan out a message to a user with 5 devices without being able to read it?
The sender does the encryption work, the server does the routing. When Sarah sends to Mom, her phone first asks the Key Distribution Service ⑥ for Mom's device list — say 5 devices — and the public pre-keys for each. Sarah's phone then runs Double Ratchet 5 times, producing 5 ciphertexts (one per device session), and ships all 5 to her Connection Server in a single bundle. The server reads the routing envelope ("device A → ciphertext 1, device B → ciphertext 2, ..."), looks up each device's current Connection Server in the Routing Service ④, and writes the matching ciphertext to that server (or to the Pending Message Store ⑦ if offline). Server has no idea what's inside any of the ciphertexts — it just sees "blob for device A," "blob for device B," etc.
When Sarah adds a new device, can it read past messages?
No, not by default. The new device generates its own key pair locally; pairing with an existing device only signs the new key as "also Sarah's." Past messages were encrypted to the old devices' keys; the new device doesn't have those private keys, so the ciphertexts are unreadable to it. Optionally, an existing device may re-send recent messages to the new device by re-encrypting them to the new key — that's how WhatsApp Web shows recent threads after pairing. But this is a cooperative push from a trusted device, not a server-side capability. The server cannot offer history because it never had it in plaintext.
How does group chat encryption scale to a 1024-person group?
Sender Keys. Each member's device has a per-group symmetric "sender chain key" used to encrypt outgoing messages from that device to that group. The sender key is distributed once per pair (sender device → recipient device) via the existing pairwise Signal session, then reused for thousands of messages. Per message, that's one symmetric encryption on the sender's phone, not 1024 Double Ratchet ops. The server still fans out the same ciphertext to every recipient device's queue, and each recipient decrypts with the sender key it has on file. Member changes (joins/leaves) trigger a sender-key rotation, paid amortized over the group's lifetime.
Why a custom binary protocol over a persistent socket, not plain HTTP?
500M concurrent users. HTTP request-response forces a TLS handshake per connection, and you'd need to re-establish frequently because phones swap networks (Wi-Fi to cellular and back). Custom binary protocols (originally XMPP, now Noise Pipes) keep a single multiplexed socket open and reuse it for every frame in both directions — receive a message, send an ack, send a new message — for the lifetime of the connection. A single TLS handshake amortizes over thousands of frames. Add to that: HTTP can't easily push from server → client without long-polling or SSE hacks; persistent sockets let the server immediately push an inbound message the moment it arrives, with no client-side polling.
How does WhatsApp Web work without storing the private key on the web?
WhatsApp Web is its own device. When you scan the QR code on web.whatsapp.com, the browser generates a fresh key pair locally (in IndexedDB / browser storage) and your phone signs the new device's public key as "also yours." From then on, the web client receives messages encrypted to its own key, decrypted in-browser. The phone is no longer required to be online (post-2021 multi-device). The trade-off is that browser storage is less robust than mobile keystores — clearing site data wipes the keys, forcing a re-pair. WhatsApp mitigates by warning users before clearing storage on the web app.
How would you handle 1M concurrent voice/video calls?
P2P first, TURN/SFU only as fallback. 1-on-1 calls go peer-to-peer via WebRTC — server load is just signaling (~few KB per call setup), trivially scalable. Group calls hit an SFU that forwards encrypted streams without decoding — SFU bandwidth is the bottleneck, but each SFU box can handle thousands of streams, and SFUs are stateless / regionally deployed. TURN relays for NAT-blocked calls are pure UDP packet pumps, trivially scalable per region. The total server-side cost is tiny compared to messaging because calls are short-lived (avg 4 min) and the heavy lifting (encoding, encrypting frames) is on the phones. WhatsApp's call infrastructure regularly handles tens of millions of concurrent calls during peak New Year's Eve traffic.
If the server is blind, how do you handle abuse and spam?
Metadata heuristics, not content inspection. The server can see send-rates, recipient diversity, account age, device reputation, and "this account just messaged 5000 strangers in 10 minutes" — all without reading messages. Reactive: users can report a chat, which copies the offending messages out of their own device (decrypted by them, since they have the key) and uploads it for review. The reporting flow is the loophole that lets WhatsApp moderate without breaking E2E for everyone. Combined with phone-number-tied identity (each abusive account costs a SIM), this absorbs most spam without inspecting content.
What's the trade-off of "server cannot read messages" for backup/restore?
WhatsApp can't help you migrate, so the user has to. Historically, WhatsApp punted: phone-to-phone backups via the OS (iCloud / Google Drive) where Apple/Google held the data, often unencrypted at rest in the cloud. In 2021, WhatsApp shipped encrypted backups: backups uploaded to iCloud/Drive are encrypted with a key derived from a 64-digit recovery code (or password). The cloud provider stores the ciphertext but doesn't have the key — same E2E principle, applied to backups. Trade-off: lose your recovery code, lose your history forever. WhatsApp considers this an acceptable price for the privacy guarantee.
The one-line summary the interviewer remembers: "WhatsApp is a fleet of stateless socket servers + a Routing Service that maps users to servers, fronted by a per-device offline message queue. Encryption — Signal Protocol's X3DH and Double Ratchet — happens on the devices, so the server is just a router of opaque ciphertext. Multi-device sync works by treating each device as its own peer with its own keys. Media bypasses the messaging path entirely via S3 with the AES key shipped inside the encrypted message."