← Back to Design & Development
High-Level Design

Google Docs

Five teammates typing in the same paragraph at the same instant — how the system orders every keystroke, never loses an edit, and converges everyone to the exact same final document

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →
Step 1

What is Google Docs?

Imagine a Friday afternoon at 4:53 PM. Sarah, Raj, Priya, Tom, and Maya are all racing to finish a quarterly product brief due at 5:00. Every one of them has the same Google Doc open in their browser. Sarah is rewriting the second paragraph. Raj is fixing typos in the introduction. Priya is adding bullet points at the bottom. Tom is highlighting Sarah's paragraph and leaving a comment. Maya has the doc open on her phone, watching from a meeting room. Every keystroke any of them types appears on every other person's screen within 200ms, with their cursor blinking in the right place and their name floating beside it. Nobody's edit is ever lost. Nobody's screen ever shows a different version of the document than anyone else's. That's the system we're designing.

It looks magical, but underneath there's an exact answer to a hard question: when two people type the same character at the same instant, whose edit wins, and how do we keep everyone's view consistent without anyone losing what they typed? A naive last-write-wins approach loses keystrokes; a "lock the doc while one person types" approach destroys the experience. The real answer is a beautiful algorithm called Operational Transformation (OT) — and an architecture built around the fact that edits are tiny operations, not whole-document uploads.

The two questions that drive every design decision below: (1) How do we resolve conflicts when two users edit the same position simultaneously, so everyone converges to the same final document? (2) How do we get every keystroke onto every collaborator's screen in under 200ms, even with 100 concurrent editors, while never losing a single edit even if a network drops?
Step 2

Requirements & Goals

Before drawing a single box, pin down exactly what the system must do. In an interview, asking these questions shows you understand that "real-time collaboration" hides a dozen sub-features.

✅ Functional Requirements

  • Real-time multi-user editing — multiple users type into the same doc, every keystroke streamed to everyone within 200ms
  • Cursor presence — see who else is in the doc and where their cursor is, with their name and color
  • Comments & suggestions — threaded comments anchored to specific text; suggestions a reviewer can accept or reject
  • Offline editing — keep editing if you lose Wi-Fi, then merge cleanly when you reconnect
  • Revision history — see every named version and restore the doc to any past state
  • Sharing — invite by email; permissions of view / comment / edit

⚙️ Non-Functional Requirements

  • Low keystroke latency — local echo instant, remote sync under 200ms p99
  • No edits ever lost — durability is non-negotiable; the user's contract is "if I typed it, it's saved"
  • Conflict-free convergence — every editor must end up viewing the exact same byte-for-byte final document, regardless of order of arrival
  • Scale — supports 100+ concurrent editors on a single doc; billions of docs system-wide
The non-functional requirements are the architectural ones. "Type into a textarea" is a 5-line problem. "Type into a textarea where five other people are also typing and end up with the exact same document" is the part that requires Operational Transformation, a stateful session server per doc, an op log for durability, and a presence channel separate from the edit channel.
Step 3

Capacity Estimation & Constraints

Numbers force every architectural choice. Our system is write-heavy on the hot path (every keystroke is a write) but each write is tiny (a 5-byte operation, not the 100KB document). That asymmetry is the entire reason this design works at all.

User scale

Assume 1 billion total users, 100 million daily active users, and 100 million docs created per day. Average doc size after editing: 100 KB (text + formatting + embedded objects).

Daily new storage

~10 TB/day

100M docs × 100KB

5-yr storage

~18 PB

10TB × 365 × 5

Concurrent editors per doc

1–5 typical

up to ~100 max

Ops/sec per active doc

~10 ops/sec

roughly one per keystroke

Hot-path traffic

If only 1% of DAU are actively editing at peak — 1M concurrent editors — each emitting ~3 ops/sec, that's ~3M ops/sec system-wide. Each op is small (op type + position + payload < 100 bytes), so total ingress is ~300 MB/s. Trivial for a fleet of session servers; impossible for a single box.

Op log volume

Every op gets persisted for replay/history. At 3M ops/sec × 100 bytes × 86400s = ~25 TB/day of op log. This grows fast; we'll snapshot periodically and prune old op logs (keeping snapshots for history).

MetricValueWhy it matters
DAU100MDrives total connection & session-server count
Concurrent editors~1MDetermines fleet size for Doc Session Servers and presence pub/sub
Ops/sec system-wide~3M/sDrives ingress bandwidth and op-log write throughput to Cassandra
Ops per editor~3/sDetermines per-user WebSocket framing, batching opportunities
Op size~100B1000× smaller than the doc — the asymmetry that makes OT viable
5-year storage~18 PBForces sharding by doc_id and tiered storage (hot SSD + cold S3)
Step 4

System APIs

Two distinct API surfaces: a REST/HTTPS surface for one-shot operations (open the doc, read history, share with someone) and a WebSocket surface for the streaming hot path (every op flowing in both directions for the duration of the session).

REST API surface
// Open a document — happens once per session
GET /api/v1/docs/:doc_id
→ 200 OK {
    "doc_id": "abc123",
    "snapshot": { ...latest snapshot of the doc... },
    "snapshot_op_id": 1042,           // last op included in the snapshot
    "tail_ops": [ ...ops 1043..1058... ],  // to bring client up to head
    "ws_url": "wss://docs.example.com/sess/abc123",
    "session_token": "eyJ..."
  }

// Revision history — read past versions
GET /api/v1/docs/:doc_id/revisions?from=<ts>&to=<ts>
→ 200 OK { "revisions": [ {ts, op_count, author_summary}, ... ] }

// Share / change permissions
POST /api/v1/docs/:doc_id/share
{ "user_email": "raj@x.com", "role": "editor" }
→ 204 No Content
WebSocket protocol — the streaming hot path
// Client → server — apply a local edit
{
  "type": "op",
  "doc_id": "abc123",
  "op": { "type": "insert", "pos": 42, "text": "h" },
  "client_op_id": 17,         // monotonic per-client
  "based_on_server_op": 1058  // the latest server op the client has seen
}

// Server → client — broadcast a transformed op
{
  "type": "op",
  "server_op_id": 1059,
  "author_user_id": "u_sarah",
  "op": { "type": "insert", "pos": 42, "text": "h" }
}

// Client → server — cursor position update (separate channel)
{
  "type": "presence",
  "cursor_pos": 156,
  "selection_range": [156, 162]
}

// Server → client — broadcast another user's cursor
{
  "type": "presence",
  "user_id": "u_raj",
  "cursor_pos": 89,
  "color": "#4a90d9"
}
Why WebSocket and not HTTP polling? A keystroke is too small and too frequent for HTTP. Polling at 200ms would burn 5 round-trips per second per user just to ask "anything new?" — most returning empty. WebSocket gives us a single long-lived TCP connection in each direction, so an op leaving Sarah's keyboard arrives on Raj's screen in one network hop, not two. Trade-off: stateful connections complicate load balancing (sticky sessions to the right Doc Session Server), but the latency win is non-negotiable.
Step 5 · CORE

The CORE Problem — Conflict Resolution

Picture the document as the string "the cat sat". At the exact same instant, two collaborators do something different to it. Sarah inserts "big " at position 4 (between "the " and "cat"). Raj inserts "red " also at position 4. They both hit the keyboard within the same 10ms window. What should the document end up looking like to both of them?

💥 Naive: last-write-wins

Server gets Sarah's op first, then Raj's. It applies Sarah's: "the big cat sat". It applies Raj's at position 4: "the redbig cat sat" — wrong, "red" got jammed into "big". Or worse, the server overwrites: it just stores Raj's version, Sarah's "big " is gone forever. An edit was silently lost.

🔒 Naive: lock the doc

One person edits at a time, others wait. Latency-wise correct, but nobody would use it — the entire point of Google Docs is real-time collaboration. Microsoft Word's pre-cloud behavior. Rejected as a product, not just an implementation.

✨ The right answer: transform, don't overwrite

Order the ops globally (server picks: Sarah first, then Raj). Apply Sarah's as-is: "the big cat sat". Now transform Raj's against it: Sarah inserted 4 chars before Raj's position, so shift Raj's position from 4 to 8. Apply: "the big red cat sat". Both edits preserved, both clients converge. This is Operational Transformation (OT).

Two famous solutions — OT vs CRDT

The industry has converged on two approaches to this problem. Google Docs uses OT; newer collaborative tools (Notion, Linear via Liveblocks, Figma's text editor) often use CRDTs.

🛠️ Operational Transformation (OT)

Each op is a tiny instruction: insert(pos, text) or delete(pos, length). A central server assigns every op a global sequence number using Lamport timestamps (clocks that respect causality). When two ops conflict, the server transforms the later one against the earlier one so they compose correctly. Server then broadcasts the (possibly-transformed) op to everyone.

Analogy: OT is like airline overbooking — when two passengers claim the same seat, a single coordinator (the gate agent) decides who got it and reroutes the other in a way that keeps the whole flight consistent. Centralized authority + transformation.

Trade-off: requires a server in the loop for every op (no peer-to-peer). Transform functions for each pair of op types are tricky to write correctly — Google has horror stories about subtle OT bugs taking years to surface.

Used by: Google Docs, Etherpad, ShareJS.

📐 Conflict-Free Replicated Data Types (CRDTs)

Each op carries enough metadata (unique ID, vector clock, parent reference) that any two ops applied in any order produce the same result — they commute by mathematical construction. Examples: RGA (Replicated Growable Array), Yjs, Automerge. The server is just a relay; it doesn't need to transform anything.

Analogy: CRDTs are like tickets where each one is self-validating — every ticket carries its own proof of where it sits in the overall order, so any kiosk in the system can scan them and end up with the same line. No coordinator needed.

Trade-off: larger metadata per op (each character may carry a 16-byte ID). The doc grows over time even if you delete content, because tombstones must be retained for correctness. Better suited to peer-to-peer / offline-first apps.

Used by: Figma multiplayer, Linear, modern collab toolkits (Yjs, Automerge).

Why Google chose OT in 2010 and stuck with it: at the time, CRDT metadata overhead made docs balloon in size, and Google already had strong server infrastructure for ordering. OT keeps the doc representation compact (just text + format runs, no per-character metadata) and pushes complexity into the transform logic on a single beefy server. The trade-off shows up in this design: every active doc is pinned to one Doc Session Server, which is a state-management story we'll dig into in §8.
Step 6 · CORE

High-Level Architecture — From Naive to Production

This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. Numbers from §3 drive every decision.

Pass 1 — The naive design (and why it breaks)

Sketch the simplest possible system: each client PUTs the entire document on every keystroke. The server stores the latest version. To open a doc, a client GETs it. To see other people's edits, a client polls every 500ms.

flowchart LR S["Sarah"] -->|"PUT full doc on each keystroke"| APP["App Server"] R["Raj"] -->|"PUT full doc on each keystroke"| APP APP --> DB[("MySQL — single doc row")] APP -->|"GET — poll every 500ms"| R APP -->|"GET — poll every 500ms"| S

Four concrete failures emerge the moment two users start typing:

💥 Bandwidth death

The doc is 100KB. Sarah types one character. Her browser uploads 100KB to save it. At 5 keystrokes per second per user, every user burns 500 KB/s on a workload that should cost ~50 bytes/s. Multiply by 1M concurrent editors: 500 GB/s of upload traffic for an op stream that's actually 50 MB/s of real information.

💥 Last-write-wins loses edits

Sarah PUTs version with her change at 14:02:06.001. Raj PUTs his version with a different change at 14:02:06.002. Raj's PUT overwrites Sarah's — her edit is gone. There's no "merge" because the server has no idea what changed; it only sees full snapshots.

💥 No presence, no cursors

Polling for the doc gives you the text, but where is Raj's cursor? Where did he highlight? Polling for cursor positions on top of polling for content doubles the round-trips. Real-time collaboration needs both, in real-time, on different cadences.

💥 No offline support

Sarah loses Wi-Fi for 2 minutes on a flight. She types a paragraph. When the connection comes back, her browser PUTs the whole doc — but Raj has been editing the same doc on the ground. Her PUT overwrites his work. There's no concept of "merge what I did with what they did" because there's no per-edit granularity.

Pass 2 — The mental model: Operations not snapshots, with central ordering

Four insights flip this design:

⚡ 1. Edits are operations, not snapshots

Don't ship the 100KB doc — ship the 50-byte op: {type:"insert", pos:42, text:"h"}. That's a 2000× bandwidth reduction. Apply the op locally and on every other client. The doc is reconstructed by applying the op stream from scratch, or from a snapshot + recent ops.

🎯 2. One server orders all ops for a single doc

Pick one machine to be the canonical authority for each doc. Every op goes through this Doc Session Server, which assigns it a global sequence number, transforms it against any concurrent ops, and broadcasts the final form to all subscribers. With one ordering authority per doc, everyone sees the same sequence — convergence is guaranteed.

🚀 3. Clients apply optimistically, reconcile on round-trip

When Sarah types 'h', her browser shows it instantly (don't make her wait for a server round-trip — that would feel laggy). The op flies to the server, gets ordered and possibly transformed, then comes back. The client compares: if my local op was transformed, I rewrite local state to match the server's version. Local-first feel + global consistency.

🗃️ 4. Persistence = log of ops + periodic snapshots

The Doc Session Server is in-memory; if it dies, the in-flight state is gone. We persist every op to durable storage as it's accepted (Cassandra append). Every 1000 ops, we snapshot the full doc to S3. Doc load = latest snapshot + replay tail ops. Crash recovery = replay log to reconstruct the session state.

This four-way split lets us scale ops independently from cursor presence (separate Redis pub/sub), permissions (separate ACL service), and history (separate snapshot store). Each tier handles a workload it's tuned for.

Pass 3 — The production shape

Now the full picture. Twelve numbered components, organized into four planes: the Edit Plane handles ops (the hot path); the Persistence Plane makes edits durable and supports history; the Presence Plane carries cursors and online users; the Permission Plane answers "are you allowed to be here?"

flowchart TB CL["① Client — browser · mobile · OT engine"] subgraph EDGE["Edge"] LB["② Load Balancer — L4 · sticky to doc-server"] end subgraph EDIT["Edit Plane"] DS["③ Doc Session Server — primary per active doc"] OTE["⑫ OT Engine — transforms incoming ops"] LOC["④ Session Locator — doc_id maps to server"] end subgraph PERSIST["Persistence Plane"] OPL[("⑤ Op Log Storage — Cassandra append-only")] SNAP[("⑥ Snapshot Store — S3 every 1000 ops")] HIST["⑩ Revision History UI"] end subgraph PRESENCE["Presence Plane"] PRES["⑧ Presence Service — Redis Pub Sub"] NOTIF["⑪ Notification Service — email · in-app"] end subgraph PERM["Permission Plane"] AUTH["⑦ Auth and Permissions — ACL store"] CMT["⑨ Comments Service"] end CL <-->|"WebSocket"| LB LB --> DS DS --> OTE DS --> LOC DS --> OPL DS --> SNAP DS --> PRES CL --> AUTH CL --> CMT CL --> HIST HIST --> OPL HIST --> SNAP CMT --> NOTIF AUTH --> NOTIF style CL fill:#e8743b,stroke:#e8743b,color:#fff style LB fill:#171d27,stroke:#9b72cf,color:#d4dae5 style DS fill:#171d27,stroke:#e8743b,color:#d4dae5 style OTE fill:#171d27,stroke:#e8743b,color:#d4dae5 style LOC fill:#171d27,stroke:#9b72cf,color:#d4dae5 style OPL fill:#171d27,stroke:#38b265,color:#d4dae5 style SNAP fill:#171d27,stroke:#38b265,color:#d4dae5 style HIST fill:#171d27,stroke:#38b265,color:#d4dae5 style PRES fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style NOTIF fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style AUTH fill:#171d27,stroke:#4a90d9,color:#d4dae5 style CMT fill:#171d27,stroke:#4a90d9,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.

Client (browser / mobile)

The browser tab or mobile app where Sarah, Raj, and the rest are typing. Inside it runs a small OT engine that tracks the local state, the last-acknowledged server op, and a queue of unacknowledged local ops. When Sarah types 'h', the engine applies it locally for instant feedback, sends it to the server with a based_on_server_op stamp, and waits for the transformed echo. When a remote op arrives, the engine transforms it against any unacknowledged local ops, applies the result, and renders.

Solves: the latency problem. Without local optimistic application, every keystroke would need a 100ms round-trip before Sarah sees her own letter appear — typing would feel like a 1990s telnet session. Local-first echo with eventual reconciliation is what makes Google Docs feel snappy.

Load Balancer (L4, sticky)

The traffic cop, but with a twist: it's a Layer-4 (TCP-level) LB because WebSockets are long-lived TCP connections, not request/response HTTP. Critically, it must be sticky — once a client's WebSocket lands on a particular Doc Session Server, every subsequent message on that connection routes to the same server. We achieve this via consistent hashing on doc_id, so all editors of doc abc123 hit the same backend.

Solves: ensuring all editors of one doc converge on one ordering authority. Without doc-affinity, two clients editing the same doc could land on different servers, neither of which has full visibility into the other's ops — convergence breaks instantly.

Doc Session Server (the heart)

One primary server per active doc. It holds in-memory: the current document state, the last server op-id, the connected clients (their WebSockets), and a small ring buffer of recent ops for transform purposes. When an op arrives, it: (1) verifies the client's based_on_server_op; (2) calls the OT Engine to transform the op against any ops the client missed; (3) assigns the next server_op_id; (4) appends to the op log; (5) broadcasts the transformed op to every other connected client. This is the linearization point — one server, one global order per doc.

Solves: the convergence guarantee. Every collaborator's view of the doc is "the result of applying the server's op stream in order". As long as one server is the authority, everyone agrees on the order, and everyone agrees on the final state.

Session Locator (Redis / ZooKeeper)

A small, fast lookup table: doc_id → primary_session_server_address. When a new client opens a doc, the LB or front-end service first asks the Session Locator "who's hosting abc123 right now?" If nobody, it picks a server (consistent hashing on doc_id, with availability checks), records the assignment, and routes the client there. On Doc Session Server crash, the locator entry is invalidated and a new server takes over after replaying the op log.

Solves: the "where do I send this op?" problem. Without the locator, two LB instances might pick different servers for the same doc, splitting brain. The locator is the single source of truth for doc-to-server mapping.

Op Log Storage (Cassandra)

An append-only log of every op ever applied, sharded by doc_id. Schema is essentially (doc_id, server_op_id, op_blob, author, timestamp) with the partition key on doc_id and clustering on server_op_id ascending. Cassandra is chosen because writes are pure appends (its happy path), partitioning is trivial, and replication across AZs is built-in. We write every op synchronously before broadcasting it — this is what makes "no edits lost" true even if the Doc Session Server crashes.

Solves: durability and crash recovery. If the in-memory Doc Session Server dies, a replacement reads the op log from the last snapshot to head, rebuilds state, and reconnects clients. Without the log, every crash would lose every in-flight edit.

Snapshot Store (S3 / GCS)

Every 1000 ops (or every 5 minutes, whichever first), the Doc Session Server serializes the current document state and uploads it to object storage with a key like docs/abc123/snap_1042.bin. The op log is then trimmable up to that point for recovery purposes (we keep older ops elsewhere for revision history). When a new client opens the doc, the server fetches the latest snapshot and tail ops, ships both to the client, and the client reconstructs the live state in two steps.

Solves: doc load time and recovery time. Without snapshots, opening a 6-month-old doc with 5M ops would mean replaying 5M ops on the client — minutes of wait. With snapshots, it's "fetch a 100KB blob + the last 50 ops" — sub-second load even on huge docs.

Auth & Permissions Service

A separate service that owns the ACL (Access Control List) — the table that says who can view, comment, or edit each doc. Schema: (doc_id, user_id, role). Called twice in the lifecycle: (1) at WebSocket-establish time, the server checks the user has at least viewer role before accepting the connection; (2) before every op write, the session server checks the user has editor role (cached for the session, invalidated on ACL change). When an owner revokes someone's access, this service emits an event that closes their WebSocket on the spot.

Solves: security at the right layer. Permissions are a horizontal concern across docs/comments/sharing — keeping them in one place prevents the bug where someone loses edit access on a doc but their existing WebSocket keeps accepting their writes.

Presence Service (Redis Pub/Sub)

A separate channel for cursor positions and "who's here". Sarah's cursor blinks at position 42; her client publishes {user_id, cursor_pos:42, color:#orange} to the Redis channel presence:abc123. Every other connected client subscribes to that channel and renders the floating cursor with Sarah's name. This data is not persisted — if you reload, presence resets fresh from whoever's currently in the doc. Volume: dozens of cursor updates per second per editor, but tiny payloads.

Solves: presence without polluting the op log. Cursor positions don't need to be ordered or transformed or persisted forever — they're transient state. Mixing them into the op stream would balloon the log with throwaway data and slow down the OT engine. Separating them onto a pub/sub fabric keeps the edit plane fast.

Comments Service

Comments are a parallel feature: threaded discussions anchored to specific text spans in the doc. Stored in their own table keyed by (doc_id, comment_id) with an anchor_op_id field that points at the op the comment refers to. When the anchored text moves (because of unrelated edits before it), the OT engine emits an "anchor adjusted" event so the comment service updates its position. Comments are loaded on demand when the user opens the side panel; they don't flow through the WebSocket op stream.

Solves: separating discussion from content. Comments have a different lifecycle (resolved/unresolved, threaded replies, notification-worthy). Mixing them into the op log would make doc loading slower and complicate the OT logic.

Revision History UI Service

The "see all changes" tab. When Sarah clicks "Version history", this service queries the op log for ops between two timestamps, replays them on top of the snapshot from that period, and renders a side-by-side diff. To restore an old version: it generates a new op stream that reverses subsequent edits, applied through the normal Doc Session Server so other connected editors see the restoration in real-time as a sequence of operations.

Solves: the "I deleted the wrong section" recovery story. Without revision history, an editor in a panic could erase three days of work. Because every op is immutable in the log, we can reconstruct any past state to the millisecond.

Notification Service

Sends emails and in-app notifications: "Sarah commented on your doc", "Raj shared a doc with you", "Priya replied to your comment". Listens to events from Comments Service and Auth Service via Kafka. Aggregates and de-dupes (you don't want 50 emails because Sarah typed for 50 minutes). Async — never on the critical path of an edit.

Solves: getting people back into the doc after a collaborative event. Without it, a comment Tom leaves on Friday at 5pm waits silently in the doc until somebody happens to open it — the loop never closes.

OT Engine (transformation logic)

A library that lives inside the Doc Session Server (and inside every client too — the algorithm is symmetric). Implements the transform function for every pair of op types: insert/insert, insert/delete, delete/delete, format/format, etc. Given two concurrent ops A and B that were both based on the same parent state, it produces transformed versions A' and B' such that applying A then B' equals B then A' equals the same final state. This convergence property is the entire correctness contract of the system.

Solves: the conflict-resolution problem in §5. Without transform, concurrent ops applied in different orders on different machines produce different final states — the doc forks. With transform, every ordering produces the same byte-for-byte result.

Concrete walkthrough — Sarah and Raj race the same sentence

Two scenarios traced through the production architecture above:

📝 Scenario A — Concurrent edits at the same position

The doc is currently "the cat sat", last server op-id 1042. Sarah is at position 4, Raj is at position 4. Both are about to insert.

  1. Sarah types 'a' at 14:02:06.001. Her Client ① applies it locally → her screen now reads "the acat sat". She sends {op:insert, pos:4, text:'a', based_on:1042, client_op:17} over the WebSocket to the Load Balancer ② → routes to Doc Session Server ③.
  2. Raj types 'b' at 14:02:06.002 (1ms later). His Client ① applies locally → his screen reads "the bcat sat". He sends {op:insert, pos:4, text:'b', based_on:1042, client_op:5}.
  3. Server receives Sarah's op first. No transform needed (it's based on 1042, the current head). DS ③ assigns server_op_id:1043, appends to Op Log ⑤, broadcasts to Raj. Doc state is now "the acat sat".
  4. Server receives Raj's op. It's also based on 1042, but the head is now 1043. DS ③ asks OT Engine ⑫ to transform Raj's op against op 1043 (Sarah's insert at pos 4). Result: Raj's op becomes {insert, pos:5, text:'b'} — shifted past Sarah's character. DS assigns server_op_id:1044, appends to log, broadcasts the transformed op to Sarah.
  5. Raj's client receives the echo of his own op (now transformed). His OT engine reconciles: it had locally applied insert at 4, but the server says insert at 5. It rewinds local state, replays Sarah's op at 4, then reapplies his op at 5. Final state on Raj's screen: "the abcat sat".
  6. Sarah's client receives Raj's transformed op and applies insert at 5. Final state on Sarah's screen: "the abcat sat".
  7. Both converged to the exact same string. Total elapsed: ~150ms. Neither edit was lost.

✈️ Scenario B — Sarah goes offline, types 100 chars, comes back

  1. At 14:00, Sarah's Wi-Fi drops. The Client's ① WebSocket reconnect logic kicks in but fails. She keeps typing — her local OT engine queues 100 ops in a local IndexedDB-backed buffer. Her screen shows her edits in real time (local-first); she has no idea anything's wrong.
  2. Meanwhile, Raj keeps editing on the ground. The Doc Session Server ③ has accepted 30 ops from him, advancing the head from 1042 to 1072. They're persisted to the Op Log ⑤.
  3. 14:08, Sarah's flight lands. WebSocket reconnects via Load Balancer ② → Session Locator ④ routes her back to the same Doc Session Server ③.
  4. Reconnect handshake: Sarah's client says "my last seen server_op was 1042; I have 100 buffered ops based on that". Server fetches ops 1043–1072 from the op log and ships them to Sarah's client.
  5. Sarah's client uses OT Engine ⑫ to transform her 100 buffered ops against Raj's 30. The result: 100 transformed ops that respect Raj's intervening changes.
  6. Sarah's client streams the 100 transformed ops to the server in order. Server applies each, advancing head to 1172, broadcasting each to Raj as it goes. Raj sees Sarah's offline work flow in over a few seconds — animated, but consistent.
  7. Both converged to a doc with everyone's contributions. No work lost from either side.
So what: the architecture is built around three insights — (1) edits are tiny ops, not whole-doc snapshots, so the network and storage costs are 1000× smaller than naive; (2) one server is the ordering authority per doc, which is where convergence comes from — but it's hidden behind a Session Locator and op-log replay so a crash means seconds of recovery, not lost work; (3) presence, comments, permissions, and history live on their own planes, each tuned for its own workload, so the hot edit path stays a single fast loop. Every box in the diagram earns its place by removing one of the failure modes from Pass 1.
Step 7

Operational Transformation — Deep Dive

OT's whole job is to take two ops that were both written against the same parent state and produce two transformed versions such that applying them in either order yields the same result. The proof is constructive: define a transform function for every pair of op types, and the algorithm just composes them.

The classic insert/insert collision

Two clients both think the doc is "the cat". Op_A is insert(pos=4, text='x'). Op_B is insert(pos=4, text='y'). They were sent concurrently — both based on the same parent. The server picks an order: A first, then B. After applying A, the doc is "the xcat". Now we need to transform B so it inserts after A's character: insert(pos=5, text='y'). Applying that gives "the xycat". Both clients converge.

sequenceDiagram participant SA as Sarah Client participant SV as Doc Session Server participant RA as Raj Client Note over SA,RA: Both clients see doc = "the cat" at server_op 1042 SA->>SV: insert pos=4 'x' based_on 1042 RA->>SV: insert pos=4 'y' based_on 1042 Note over SV: Server receives Sarah first SV->>SV: Apply A as op 1043 — doc = "the xcat" SV-->>RA: broadcast op 1043 insert pos=4 'x' Note over SV: Now Raj's op arrives based_on 1042 — must transform against 1043 SV->>SV: Transform B against A — pos shifts 4 → 5 SV->>SV: Apply B' as op 1044 — doc = "the xycat" SV-->>SA: broadcast op 1044 insert pos=5 'y' Note over SA,RA: Both clients now show "the xycat"

The transform table — every op-pair has a rule

flowchart LR subgraph II["insert vs insert"] II1["A inserts at pos_a — B inserts at pos_b"] II2["if pos_b less than pos_a then B unchanged — if pos_b greater than pos_a then shift B by len_a — if equal then tiebreak by client_id"] II1 --> II2 end subgraph IDEL["insert vs delete"] ID1["A inserts at pos_a len_a — B deletes pos_b len_b"] ID2["A before B range then shift B pos right — A inside B range then split delete around insert — A after B then B unchanged"] ID1 --> ID2 end subgraph DD["delete vs delete"] DD1["A deletes pos_a len_a — B deletes pos_b len_b"] DD2["disjoint ranges then only shift positions — overlapping then subtract overlap from B — B fully inside A then B becomes no-op"] DD1 --> DD2 end style II fill:#171d27,stroke:#e8743b,color:#d4dae5 style IDEL fill:#171d27,stroke:#4a90d9,color:#d4dae5 style DD fill:#171d27,stroke:#e05252,color:#d4dae5

insert vs insert

If B inserts before A's position, B is unchanged. If B inserts at or after A, shift B's position right by A's length. Ties (same position) are broken by a deterministic client_id rule so all sites pick the same order.

insert vs delete

If A's insert lands before B's delete range, shift B's position. If A lands inside B's delete range, split B into two deletes around the inserted text — neither side wanted to delete what wasn't there.

delete vs delete

If ranges don't overlap, just adjust positions. If they overlap, subtract the overlap from B's length — A already deleted it. If B is fully inside A's range, B becomes a no-op. Critical: never double-delete.

The interview move: walk through the insert/insert example with a concrete string. Then mention "and there are similar rules for the other 8 op-pair combinations in a basic OT system, plus another layer for rich-text formatting like bold/italic". Don't try to derive the full table on the whiteboard — interviewers want to see you understand the structure, not memorize the rules.
Step 8

Doc Session Server — The Stateful Bottleneck

The cleanest way to guarantee convergence is to have one machine be the global ordering authority for each doc. That's also the most uncomfortable architectural pattern for a distributed system — stateful, sticky, single-point-of-something. Let's look at what that machine holds and how we keep it from being a brittle single point of failure.

What lives in memory on a Doc Session Server

📦 Per-doc state

  • Current document state — the live string + format runs, ~100KB per doc
  • Last server op-id — monotonic counter, increments on every accepted op
  • Recent ops ring buffer — last ~1000 ops, used for transforming late-arriving client ops
  • Connected clients — list of WebSockets, each with the user_id and last_acked_op_id
  • Pending broadcast queue — ops accepted but not yet flushed to all clients

🔢 Sizing & capacity

One commodity server with 64 GB RAM can comfortably host ~10,000 active docs at 100KB each + buffers + connection overhead. With 1M concurrent editors spread across docs (avg 3 editors/doc → ~330K active docs), we need ~33 Doc Session Servers. With headroom for skew (one viral doc with 100 editors counts more), call it 50 servers.

The single-server bottleneck — and how we mitigate

⚠️ The trade-off

Pinning each doc to one server gives us simple, correct OT. But it caps a single doc at one server's CPU. A pathological case — 500 people editing a viral document — could saturate one box. There's also a failover blast radius: if the server dies, every doc on it is offline until recovery.

✅ Mitigations

  • Shard by doc_id via consistent hashing across the fleet, so popular docs are spread evenly
  • Replicas: each doc has a hot standby that tails the op log; on primary crash, standby promotes in seconds
  • Op-log replay: any healthy server can take over by reading the log from last snapshot
  • Cap concurrent editors per doc at e.g. 100 active + unlimited viewers; viewers don't go through the OT loop

Failover sequence

sequenceDiagram participant CL as Connected Clients participant LB as Load Balancer participant DS1 as Primary Server participant DS2 as Standby Server participant LOC as Session Locator participant LOG as Op Log Note over DS1: Crash at 14:02:30 CL->>LB: WebSocket frame LB->>DS1: forward Note over LB: TCP connection broken — LB notices in 5s health check LB->>LOC: doc abc123 — who hosts now? LOC->>LOC: invalidate DS1 entry LOC->>DS2: promote you for abc123 DS2->>LOG: read snapshot + ops since LOG-->>DS2: state rebuilt LOC-->>LB: route abc123 to DS2 now CL->>LB: client reconnects WebSocket LB->>DS2: forward DS2-->>CL: handshake — last server_op = 1072 — replay anything you have based on older Note over CL,DS2: Editing resumes — total downtime ~10s
Step 9

Persistence Strategy — Op Log + Snapshots

Two-layer persistence: every op is durably appended to Cassandra synchronously before broadcast (this is what backs the "no edits lost" promise). Periodically, the doc state is snapshotted to S3 so we can load fast and prune old ops.

📜 Op log — Cassandra append-only

Every op the Doc Session Server accepts goes through INSERT INTO ops (doc_id, server_op_id, op_blob, author, ts) VALUES (...) with a partition key on doc_id and clustering on server_op_id ascending. Cassandra's LSM-tree storage makes this a pure-append workload — no random I/O, no read-modify-write. We use a quorum write (W=2 of 3 replicas) so no edit is acknowledged until two nodes have it on disk.

Why Cassandra: append-only ops are its happy path; sharding by doc_id falls out of the partition model; multi-region replication is built-in for global doc availability.

📸 Snapshots — S3 / GCS every 1000 ops

A background task on the Doc Session Server serializes the current state and uploads docs/abc123/snap_1042.bin to object storage. Cadence: every 1000 ops or 5 minutes, whichever first. After the snapshot is durable, ops up to that point can be archived from the hot Cassandra table to a cold tier.

Why object storage: snapshots are large blobs (~100KB-10MB), cheap to store, and accessed only on doc-load. S3's 11-nines durability is exactly what we want for content that must outlive any one database.

Doc load flow — fast even on a 5-year-old document

Sarah opens a doc that's been edited for 3 years. Naïvely, replaying 5 million ops on the client would take minutes. With snapshots:

  1. Server queries object store for the latest snapshot — snap_4998000.bin (~100KB blob).
  2. Server queries Cassandra for ops with server_op_id > 4998000 — typically 50-1000 ops.
  3. Ships snapshot + tail ops to the client over WebSocket.
  4. Client reconstructs: deserialize snapshot, apply tail ops in order. Done in <200ms.

Revision history flow

"Show me the doc as it was last Tuesday at 3pm" is a slightly different read pattern. Find the snapshot just before that timestamp, replay ops up to (but not past) the target timestamp. To restore that version: generate a fresh op stream that "diffs" from current state back to the target — apply through the normal Doc Session Server so other connected editors see the restoration as a series of natural ops, with normal undo/redo semantics.

The durability contract: a user's edit is "saved" the moment Cassandra acknowledges the quorum write — before the broadcast to other clients. So if the server crashes between "acked to Sarah's client" and "broadcast to Raj", Sarah's edit is safe in the log; Raj just gets it a few seconds later when the standby takes over. The op log is the single source of truth for "what happened in this doc".
Step 10

Offline Support

Sarah on a transatlantic flight. No Wi-Fi for 8 hours. She opens her doc, edits for 90 minutes, makes 1500 changes. When she lands, she needs all her work to merge cleanly with whatever Raj has been doing on the ground. This is OT's strongest selling point — the same algorithm that handles 50ms latency handles 8-hour latency without changes.

Client-side mechanics

📱 What the client buffers

  • Ordered queue of unacked ops in IndexedDB (survives tab refresh)
  • Last-known-good snapshot from the start of the offline session
  • Last server_op_id seen before disconnect
  • Local doc state (rendered for instant editing while offline)

🔄 What happens on reconnect

  1. WebSocket reconnects, handshake includes last_seen:1042 + count of buffered ops
  2. Server sends ops 1043..head — the missed remote ops
  3. Client transforms its 1500 buffered ops against the missed ops
  4. Client streams transformed ops to server, in order
  5. Server applies them, broadcasts each to other clients in real-time

Edge cases & limits

⚖️ Long offline = more conflicts

The longer Sarah is gone, the more remote ops her local ops must transform against. In the worst case (8 hours offline, busy doc), the transform pass might be expensive — but it's still milliseconds per op, so even 10K transforms is <1s of CPU.

⚠️ Snapshot drift

If Sarah's offline long enough that the doc has been re-snapshotted, she may need to fetch a newer snapshot and rebase her ops on it. Server detects this from her stale last_seen and ships her the new snapshot first.

🛑 Hard conflicts (rare)

If Raj deleted the section Sarah was editing, her edits are technically in a no-longer-existing region. OT handles this: her inserts become no-ops on a deleted range. UI surfaces a "your edits to deleted section" message with the recovered text.

Step 11

Cursor Presence — Lighter-Weight Channel

Cursor positions and "who's online" data have a totally different shape from edit ops: frequent, ephemeral, never persisted, no transform needed. Forcing them through the OT pipeline would bloat the op log with throwaway data. We carry them on a separate Redis Pub/Sub channel.

📡 What flows through the presence channel

  • {user_id, cursor_pos, color} — every time the cursor moves
  • {user_id, selection_start, selection_end} — when text is highlighted
  • {user_id, joined} / {user_id, left} — connection events
  • {user_id, typing:true/false} — debounced "is typing now" indicator

📊 Volume & sizing

~10 cursor updates/sec per active editor at peak. With 1M concurrent editors that's 10M presence msgs/sec across the whole system — but each is <100 bytes and Redis Pub/Sub fans it out via doc-specific channels (only same-doc collaborators see each other's presence). One doc with 5 collaborators = 50 msgs/sec on its channel — trivial.

How positions stay accurate when text shifts

Sarah's cursor is at position 156. Raj inserts 10 characters at position 50. Sarah's cursor should now be at position 166, not 156 (it's still on the same word). The Doc Session Server, after applying Raj's op, fires a presence event to shift every other user's cursor positions accordingly. Without this, cursors would drift to incorrect positions every time someone else edited.

So what: separating presence from edits is the kind of decision that looks small in a diagram but saves the whole system. If cursor updates went through OT, an editor moving their cursor 10 times a second would burn 10 ops/sec of log writes — for data we throw away on disconnect. Pub/Sub on Redis is the right tool: ephemeral, fan-out, fast.
Step 12

Comments & Suggestions

Comments are a feature of the doc but a separate data model. Tom highlights a sentence and writes "needs more detail" — that comment must stick to the sentence even as Sarah keeps editing the surrounding paragraphs.

📌 Anchoring

Each comment stores an anchor_op_id — the server_op_id of the op that produced the text it refers to. As subsequent ops shift the anchored text around, the OT engine emits anchor-adjustment events that the Comments Service consumes to update the position. If the anchored text is fully deleted, the comment becomes "orphaned" and shows in a side panel without an inline marker.

💬 Threaded replies

Comments are a tree: root + replies. Stored as (doc_id, comment_id, parent_id, author, body, ts, resolved). Loaded on demand when the side panel opens — not streamed through WebSocket like ops, because they're rare and bursty.

✅ Suggestions (track changes)

A "suggestion" is an op that's marked pending instead of applied directly. Other editors see it as inline strikethrough/insertion-style markup. The author or an editor with permission accepts it (op promoted to applied) or rejects it (op discarded). The OT layer handles the math identically — the only difference is the rendering and the gating on accept.

🔔 Notifications

Comment posted → emit event to Kafka → Notification Service ⑪ aggregates and sends email/in-app. We de-dupe so Tom doesn't get 50 emails for Sarah's 50-comment review session.

Step 13

Sharing & Permissions

Every doc has an ACL — the table that says who can do what. Permission checks happen at two layers: the front door (WebSocket establish) and the inner kitchen (every write).

🚪 WebSocket establish

Before accepting a connection, the front-end checks that the user's session token + doc_id pair has at least viewer role in the ACL. No role → 403, connection refused. This is the cheap gate that keeps unauthorized users out entirely.

🍳 Per-op role check

Inside the Doc Session Server, every op is checked against the user's role (cached from the open). Viewer trying to write → reject with error frame. The cache is invalidated when the Permissions Service emits a "role changed for this user" event.

🚫 Live revocation

Owner removes Sarah's edit access while she's actively editing → Permissions Service emits an event to the Doc Session Server → server downgrades her role mid-session and starts rejecting her ops; or for "removed entirely", server closes her WebSocket. Within 1-2 seconds her tab shows "you no longer have access".

Sharing flow

Sarah clicks Share, enters Raj's email, picks "editor". Front-end POSTs to /api/v1/docs/abc123/share with body {user_email, role}. Permissions Service writes to the ACL table, then fires a notification event ("Sarah shared a doc with you"). Raj gets an email link; clicking it opens the doc with the editor role baked into his session.

Step 14

Data Partitioning

18 PB of doc data and 25 TB/day of op logs do not fit on one box. We shard everywhere — but the sharding key is identical across services to keep things simple: partition by doc_id, hashed consistently.

📚 Op log (Cassandra)

Partition key on doc_id; clustering on server_op_id. All ops for one doc land on the same partition so they can be read in order with one query. Replication factor 3 across AZs.

📦 Snapshots (S3)

Object key includes doc_id as a prefix; S3's internal partitioning handles the rest. Lifecycle rules move snapshots older than 90 days to Glacier for cold storage at 1/10 the cost.

🖥️ Doc Session Servers

Same consistent-hash ring on doc_id, so the same doc always lands on the same server (until reshuffle). The Session Locator caches the ring; on server-add or server-remove, only ~1/N of docs migrate.

🔐 ACL store (Postgres or DynamoDB)

Partition by doc_id. The hot read is "give me Sarah's role on doc abc123" — a point query, fast on either store. Permissions tables are tiny relative to op logs (one row per share, not per edit), so a single Postgres cluster shard could hold years of growth.

Why doc_id everywhere: co-locating all of a doc's data on the same shard means the Doc Session Server, the op log query, the snapshot fetch, and the ACL check all hit the same physical region. No cross-shard joins, no fan-out reads on the hot path. The price: a viral 100-editor doc puts disproportionate load on its shard — solved by the per-server caps mentioned in §8.
Step 15

Interview Q&A

OT vs CRDT — when do you pick each?
Pick OT when you control the server tier and want compact doc representation — Google Docs, Etherpad. The transform logic is hairy but the on-the-wire and at-rest size is just text + ops. Pick CRDT when you need peer-to-peer or offline-first with infrequent sync — Figma multiplayer, mobile-heavy collaborative apps. CRDT metadata bloats the doc but eliminates the central server. Modern toolkits (Yjs, Automerge) make CRDT integration easier than 5 years ago, so the gap is narrowing.
How does the server handle 100 concurrent editors on one doc?
Single server, single thread on the hot path — but OT is so cheap (transform is microseconds) that even 100 editors at 5 ops/sec each = 500 ops/sec on one core is fine. The bottleneck is op-log write throughput, which Cassandra handles at >10K writes/sec per node. Above 100 editors, we'd cap further new joiners or move them to a "read-only with auto-refresh" mode — UX trick.
What if the Doc Session Server crashes mid-edit?
The op log is the source of truth. Every op was durably written to Cassandra before being broadcast. On crash, the Session Locator routes the doc to a standby server, which reads the latest snapshot + tail ops to reconstruct in-memory state in seconds. Connected clients reconnect via the LB, handshake with their last-seen op-id, and any ops they had pending are replayed (transformed against missed ops). Total user-visible downtime: 5-15 seconds.
How do you reconstruct yesterday's 3pm version of the doc?
Find the snapshot just before 3pm yesterday — say snap_4982000.bin from 2:55pm. Replay ops 4982001..N from Cassandra, where N is the last op-id with a timestamp ≤ 3pm. Stop. The reconstructed state is exactly what users saw at 3pm. To restore that as the live version, generate a delta op stream that takes the current doc back to that state — and feed those ops through the normal Doc Session Server so other connected editors see the restoration as a sequence of natural ops with normal animation and undo support.
How do you handle a 1500-op offline edit on reconnect?
Stream them sequentially after transforming. Client buffers ops locally with their original parent op-id. On reconnect, server ships the missed remote ops; client transforms its 1500 buffered ops against them (a single batched transform pass, microseconds per op). Client then streams the transformed ops to the server in order, server applies and broadcasts each. Other editors see the offline batch flow in over a few seconds — visually animated, but correct.
How does cursor presence not flood the network?
Three tactics. (1) Separate Redis Pub/Sub channel — cursors don't go through the OT pipeline. (2) Client-side throttle — cursor updates batched at most every 50ms, so even rapid arrow-key spam emits 20 msgs/sec max per user. (3) Channel scope — only collaborators on the same doc subscribe, so a 100M-user system never has any single channel with more than ~100 subscribers.
What if two users delete the same character?
The transform handles it. Both ops are delete(pos=42, len=1). Server applies the first, deleting the character; assigns op-id 1043, broadcasts. Server receives the second, calls transform: "this delete is fully inside an already-deleted range → becomes a no-op." Op is logged as a no-op for audit, broadcast as a no-op, both clients converge. The second user's intention (remove that character) is satisfied by the first user's action — nothing was lost, nothing was double-applied.
Why WebSocket and not gRPC streaming or HTTP/2 push?
Browser support and bidirectional simplicity. WebSocket has been universally supported in browsers since 2011 — no polyfills, no transport hacks. gRPC-web exists but is awkward for the symmetric bidirectional flow we need. HTTP/2 Server Push was deprecated in major browsers. WebSocket is just TCP with a framing protocol, which is exactly the abstraction we want for "long-lived bidirectional message stream". For our mobile native apps, we use the same WebSocket protocol over the OS's network APIs.
The one-line summary the interviewer remembers: "Each doc has one Doc Session Server that orders every op via Operational Transformation, persists it to a Cassandra op log before broadcasting to all collaborators over sticky WebSockets, and snapshots to S3 every 1000 ops — with cursor presence on a separate Redis pub/sub channel and permissions checked both at WebSocket establish and per-op. The whole architecture is built around the asymmetry that edits are 100-byte ops, not 100-KB documents."