Five teammates typing in the same paragraph at the same instant — how the system orders every keystroke, never loses an edit, and converges everyone to the exact same final document
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
Imagine a Friday afternoon at 4:53 PM. Sarah, Raj, Priya, Tom, and Maya are all racing to finish a quarterly product brief due at 5:00. Every one of them has the same Google Doc open in their browser. Sarah is rewriting the second paragraph. Raj is fixing typos in the introduction. Priya is adding bullet points at the bottom. Tom is highlighting Sarah's paragraph and leaving a comment. Maya has the doc open on her phone, watching from a meeting room. Every keystroke any of them types appears on every other person's screen within 200ms, with their cursor blinking in the right place and their name floating beside it. Nobody's edit is ever lost. Nobody's screen ever shows a different version of the document than anyone else's. That's the system we're designing.
It looks magical, but underneath there's an exact answer to a hard question: when two people type the same character at the same instant, whose edit wins, and how do we keep everyone's view consistent without anyone losing what they typed? A naive last-write-wins approach loses keystrokes; a "lock the doc while one person types" approach destroys the experience. The real answer is a beautiful algorithm called Operational Transformation (OT) — and an architecture built around the fact that edits are tiny operations, not whole-document uploads.
Before drawing a single box, pin down exactly what the system must do. In an interview, asking these questions shows you understand that "real-time collaboration" hides a dozen sub-features.
Numbers force every architectural choice. Our system is write-heavy on the hot path (every keystroke is a write) but each write is tiny (a 5-byte operation, not the 100KB document). That asymmetry is the entire reason this design works at all.
Assume 1 billion total users, 100 million daily active users, and 100 million docs created per day. Average doc size after editing: 100 KB (text + formatting + embedded objects).
~10 TB/day
100M docs × 100KB
~18 PB
10TB × 365 × 5
1–5 typical
up to ~100 max
~10 ops/sec
roughly one per keystroke
If only 1% of DAU are actively editing at peak — 1M concurrent editors — each emitting ~3 ops/sec, that's ~3M ops/sec system-wide. Each op is small (op type + position + payload < 100 bytes), so total ingress is ~300 MB/s. Trivial for a fleet of session servers; impossible for a single box.
Every op gets persisted for replay/history. At 3M ops/sec × 100 bytes × 86400s = ~25 TB/day of op log. This grows fast; we'll snapshot periodically and prune old op logs (keeping snapshots for history).
| Metric | Value | Why it matters |
|---|---|---|
| DAU | 100M | Drives total connection & session-server count |
| Concurrent editors | ~1M | Determines fleet size for Doc Session Servers and presence pub/sub |
| Ops/sec system-wide | ~3M/s | Drives ingress bandwidth and op-log write throughput to Cassandra |
| Ops per editor | ~3/s | Determines per-user WebSocket framing, batching opportunities |
| Op size | ~100B | 1000× smaller than the doc — the asymmetry that makes OT viable |
| 5-year storage | ~18 PB | Forces sharding by doc_id and tiered storage (hot SSD + cold S3) |
Two distinct API surfaces: a REST/HTTPS surface for one-shot operations (open the doc, read history, share with someone) and a WebSocket surface for the streaming hot path (every op flowing in both directions for the duration of the session).
REST API surface// Open a document — happens once per session GET /api/v1/docs/:doc_id → 200 OK { "doc_id": "abc123", "snapshot": { ...latest snapshot of the doc... }, "snapshot_op_id": 1042, // last op included in the snapshot "tail_ops": [ ...ops 1043..1058... ], // to bring client up to head "ws_url": "wss://docs.example.com/sess/abc123", "session_token": "eyJ..." } // Revision history — read past versions GET /api/v1/docs/:doc_id/revisions?from=<ts>&to=<ts> → 200 OK { "revisions": [ {ts, op_count, author_summary}, ... ] } // Share / change permissions POST /api/v1/docs/:doc_id/share { "user_email": "raj@x.com", "role": "editor" } → 204 No ContentWebSocket protocol — the streaming hot path
// Client → server — apply a local edit { "type": "op", "doc_id": "abc123", "op": { "type": "insert", "pos": 42, "text": "h" }, "client_op_id": 17, // monotonic per-client "based_on_server_op": 1058 // the latest server op the client has seen } // Server → client — broadcast a transformed op { "type": "op", "server_op_id": 1059, "author_user_id": "u_sarah", "op": { "type": "insert", "pos": 42, "text": "h" } } // Client → server — cursor position update (separate channel) { "type": "presence", "cursor_pos": 156, "selection_range": [156, 162] } // Server → client — broadcast another user's cursor { "type": "presence", "user_id": "u_raj", "cursor_pos": 89, "color": "#4a90d9" }
Picture the document as the string "the cat sat". At the exact same instant, two collaborators do something different to it. Sarah inserts "big " at position 4 (between "the " and "cat"). Raj inserts "red " also at position 4. They both hit the keyboard within the same 10ms window. What should the document end up looking like to both of them?
Server gets Sarah's op first, then Raj's. It applies Sarah's: "the big cat sat". It applies Raj's at position 4: "the redbig cat sat" — wrong, "red" got jammed into "big". Or worse, the server overwrites: it just stores Raj's version, Sarah's "big " is gone forever. An edit was silently lost.
One person edits at a time, others wait. Latency-wise correct, but nobody would use it — the entire point of Google Docs is real-time collaboration. Microsoft Word's pre-cloud behavior. Rejected as a product, not just an implementation.
Order the ops globally (server picks: Sarah first, then Raj). Apply Sarah's as-is: "the big cat sat". Now transform Raj's against it: Sarah inserted 4 chars before Raj's position, so shift Raj's position from 4 to 8. Apply: "the big red cat sat". Both edits preserved, both clients converge. This is Operational Transformation (OT).
The industry has converged on two approaches to this problem. Google Docs uses OT; newer collaborative tools (Notion, Linear via Liveblocks, Figma's text editor) often use CRDTs.
Each op is a tiny instruction: insert(pos, text) or delete(pos, length). A central server assigns every op a global sequence number using Lamport timestamps (clocks that respect causality). When two ops conflict, the server transforms the later one against the earlier one so they compose correctly. Server then broadcasts the (possibly-transformed) op to everyone.
Analogy: OT is like airline overbooking — when two passengers claim the same seat, a single coordinator (the gate agent) decides who got it and reroutes the other in a way that keeps the whole flight consistent. Centralized authority + transformation.
Trade-off: requires a server in the loop for every op (no peer-to-peer). Transform functions for each pair of op types are tricky to write correctly — Google has horror stories about subtle OT bugs taking years to surface.
Used by: Google Docs, Etherpad, ShareJS.
Each op carries enough metadata (unique ID, vector clock, parent reference) that any two ops applied in any order produce the same result — they commute by mathematical construction. Examples: RGA (Replicated Growable Array), Yjs, Automerge. The server is just a relay; it doesn't need to transform anything.
Analogy: CRDTs are like tickets where each one is self-validating — every ticket carries its own proof of where it sits in the overall order, so any kiosk in the system can scan them and end up with the same line. No coordinator needed.
Trade-off: larger metadata per op (each character may carry a 16-byte ID). The doc grows over time even if you delete content, because tombstones must be retained for correctness. Better suited to peer-to-peer / offline-first apps.
Used by: Figma multiplayer, Linear, modern collab toolkits (Yjs, Automerge).
This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. Numbers from §3 drive every decision.
Sketch the simplest possible system: each client PUTs the entire document on every keystroke. The server stores the latest version. To open a doc, a client GETs it. To see other people's edits, a client polls every 500ms.
Four concrete failures emerge the moment two users start typing:
The doc is 100KB. Sarah types one character. Her browser uploads 100KB to save it. At 5 keystrokes per second per user, every user burns 500 KB/s on a workload that should cost ~50 bytes/s. Multiply by 1M concurrent editors: 500 GB/s of upload traffic for an op stream that's actually 50 MB/s of real information.
Sarah PUTs version with her change at 14:02:06.001. Raj PUTs his version with a different change at 14:02:06.002. Raj's PUT overwrites Sarah's — her edit is gone. There's no "merge" because the server has no idea what changed; it only sees full snapshots.
Polling for the doc gives you the text, but where is Raj's cursor? Where did he highlight? Polling for cursor positions on top of polling for content doubles the round-trips. Real-time collaboration needs both, in real-time, on different cadences.
Sarah loses Wi-Fi for 2 minutes on a flight. She types a paragraph. When the connection comes back, her browser PUTs the whole doc — but Raj has been editing the same doc on the ground. Her PUT overwrites his work. There's no concept of "merge what I did with what they did" because there's no per-edit granularity.
Four insights flip this design:
Don't ship the 100KB doc — ship the 50-byte op: {type:"insert", pos:42, text:"h"}. That's a 2000× bandwidth reduction. Apply the op locally and on every other client. The doc is reconstructed by applying the op stream from scratch, or from a snapshot + recent ops.
Pick one machine to be the canonical authority for each doc. Every op goes through this Doc Session Server, which assigns it a global sequence number, transforms it against any concurrent ops, and broadcasts the final form to all subscribers. With one ordering authority per doc, everyone sees the same sequence — convergence is guaranteed.
When Sarah types 'h', her browser shows it instantly (don't make her wait for a server round-trip — that would feel laggy). The op flies to the server, gets ordered and possibly transformed, then comes back. The client compares: if my local op was transformed, I rewrite local state to match the server's version. Local-first feel + global consistency.
The Doc Session Server is in-memory; if it dies, the in-flight state is gone. We persist every op to durable storage as it's accepted (Cassandra append). Every 1000 ops, we snapshot the full doc to S3. Doc load = latest snapshot + replay tail ops. Crash recovery = replay log to reconstruct the session state.
This four-way split lets us scale ops independently from cursor presence (separate Redis pub/sub), permissions (separate ACL service), and history (separate snapshot store). Each tier handles a workload it's tuned for.
Now the full picture. Twelve numbered components, organized into four planes: the Edit Plane handles ops (the hot path); the Persistence Plane makes edits durable and supports history; the Presence Plane carries cursors and online users; the Permission Plane answers "are you allowed to be here?"
Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.
The browser tab or mobile app where Sarah, Raj, and the rest are typing. Inside it runs a small OT engine that tracks the local state, the last-acknowledged server op, and a queue of unacknowledged local ops. When Sarah types 'h', the engine applies it locally for instant feedback, sends it to the server with a based_on_server_op stamp, and waits for the transformed echo. When a remote op arrives, the engine transforms it against any unacknowledged local ops, applies the result, and renders.
Solves: the latency problem. Without local optimistic application, every keystroke would need a 100ms round-trip before Sarah sees her own letter appear — typing would feel like a 1990s telnet session. Local-first echo with eventual reconciliation is what makes Google Docs feel snappy.
The traffic cop, but with a twist: it's a Layer-4 (TCP-level) LB because WebSockets are long-lived TCP connections, not request/response HTTP. Critically, it must be sticky — once a client's WebSocket lands on a particular Doc Session Server, every subsequent message on that connection routes to the same server. We achieve this via consistent hashing on doc_id, so all editors of doc abc123 hit the same backend.
Solves: ensuring all editors of one doc converge on one ordering authority. Without doc-affinity, two clients editing the same doc could land on different servers, neither of which has full visibility into the other's ops — convergence breaks instantly.
One primary server per active doc. It holds in-memory: the current document state, the last server op-id, the connected clients (their WebSockets), and a small ring buffer of recent ops for transform purposes. When an op arrives, it: (1) verifies the client's based_on_server_op; (2) calls the OT Engine to transform the op against any ops the client missed; (3) assigns the next server_op_id; (4) appends to the op log; (5) broadcasts the transformed op to every other connected client. This is the linearization point — one server, one global order per doc.
Solves: the convergence guarantee. Every collaborator's view of the doc is "the result of applying the server's op stream in order". As long as one server is the authority, everyone agrees on the order, and everyone agrees on the final state.
A small, fast lookup table: doc_id → primary_session_server_address. When a new client opens a doc, the LB or front-end service first asks the Session Locator "who's hosting abc123 right now?" If nobody, it picks a server (consistent hashing on doc_id, with availability checks), records the assignment, and routes the client there. On Doc Session Server crash, the locator entry is invalidated and a new server takes over after replaying the op log.
Solves: the "where do I send this op?" problem. Without the locator, two LB instances might pick different servers for the same doc, splitting brain. The locator is the single source of truth for doc-to-server mapping.
An append-only log of every op ever applied, sharded by doc_id. Schema is essentially (doc_id, server_op_id, op_blob, author, timestamp) with the partition key on doc_id and clustering on server_op_id ascending. Cassandra is chosen because writes are pure appends (its happy path), partitioning is trivial, and replication across AZs is built-in. We write every op synchronously before broadcasting it — this is what makes "no edits lost" true even if the Doc Session Server crashes.
Solves: durability and crash recovery. If the in-memory Doc Session Server dies, a replacement reads the op log from the last snapshot to head, rebuilds state, and reconnects clients. Without the log, every crash would lose every in-flight edit.
Every 1000 ops (or every 5 minutes, whichever first), the Doc Session Server serializes the current document state and uploads it to object storage with a key like docs/abc123/snap_1042.bin. The op log is then trimmable up to that point for recovery purposes (we keep older ops elsewhere for revision history). When a new client opens the doc, the server fetches the latest snapshot and tail ops, ships both to the client, and the client reconstructs the live state in two steps.
Solves: doc load time and recovery time. Without snapshots, opening a 6-month-old doc with 5M ops would mean replaying 5M ops on the client — minutes of wait. With snapshots, it's "fetch a 100KB blob + the last 50 ops" — sub-second load even on huge docs.
A separate service that owns the ACL (Access Control List) — the table that says who can view, comment, or edit each doc. Schema: (doc_id, user_id, role). Called twice in the lifecycle: (1) at WebSocket-establish time, the server checks the user has at least viewer role before accepting the connection; (2) before every op write, the session server checks the user has editor role (cached for the session, invalidated on ACL change). When an owner revokes someone's access, this service emits an event that closes their WebSocket on the spot.
Solves: security at the right layer. Permissions are a horizontal concern across docs/comments/sharing — keeping them in one place prevents the bug where someone loses edit access on a doc but their existing WebSocket keeps accepting their writes.
A separate channel for cursor positions and "who's here". Sarah's cursor blinks at position 42; her client publishes {user_id, cursor_pos:42, color:#orange} to the Redis channel presence:abc123. Every other connected client subscribes to that channel and renders the floating cursor with Sarah's name. This data is not persisted — if you reload, presence resets fresh from whoever's currently in the doc. Volume: dozens of cursor updates per second per editor, but tiny payloads.
Solves: presence without polluting the op log. Cursor positions don't need to be ordered or transformed or persisted forever — they're transient state. Mixing them into the op stream would balloon the log with throwaway data and slow down the OT engine. Separating them onto a pub/sub fabric keeps the edit plane fast.
Comments are a parallel feature: threaded discussions anchored to specific text spans in the doc. Stored in their own table keyed by (doc_id, comment_id) with an anchor_op_id field that points at the op the comment refers to. When the anchored text moves (because of unrelated edits before it), the OT engine emits an "anchor adjusted" event so the comment service updates its position. Comments are loaded on demand when the user opens the side panel; they don't flow through the WebSocket op stream.
Solves: separating discussion from content. Comments have a different lifecycle (resolved/unresolved, threaded replies, notification-worthy). Mixing them into the op log would make doc loading slower and complicate the OT logic.
The "see all changes" tab. When Sarah clicks "Version history", this service queries the op log for ops between two timestamps, replays them on top of the snapshot from that period, and renders a side-by-side diff. To restore an old version: it generates a new op stream that reverses subsequent edits, applied through the normal Doc Session Server so other connected editors see the restoration in real-time as a sequence of operations.
Solves: the "I deleted the wrong section" recovery story. Without revision history, an editor in a panic could erase three days of work. Because every op is immutable in the log, we can reconstruct any past state to the millisecond.
Sends emails and in-app notifications: "Sarah commented on your doc", "Raj shared a doc with you", "Priya replied to your comment". Listens to events from Comments Service and Auth Service via Kafka. Aggregates and de-dupes (you don't want 50 emails because Sarah typed for 50 minutes). Async — never on the critical path of an edit.
Solves: getting people back into the doc after a collaborative event. Without it, a comment Tom leaves on Friday at 5pm waits silently in the doc until somebody happens to open it — the loop never closes.
A library that lives inside the Doc Session Server (and inside every client too — the algorithm is symmetric). Implements the transform function for every pair of op types: insert/insert, insert/delete, delete/delete, format/format, etc. Given two concurrent ops A and B that were both based on the same parent state, it produces transformed versions A' and B' such that applying A then B' equals B then A' equals the same final state. This convergence property is the entire correctness contract of the system.
Solves: the conflict-resolution problem in §5. Without transform, concurrent ops applied in different orders on different machines produce different final states — the doc forks. With transform, every ordering produces the same byte-for-byte result.
Two scenarios traced through the production architecture above:
The doc is currently "the cat sat", last server op-id 1042. Sarah is at position 4, Raj is at position 4. Both are about to insert.
"the acat sat". She sends {op:insert, pos:4, text:'a', based_on:1042, client_op:17} over the WebSocket to the Load Balancer ② → routes to Doc Session Server ③."the bcat sat". He sends {op:insert, pos:4, text:'b', based_on:1042, client_op:5}.server_op_id:1043, appends to Op Log ⑤, broadcasts to Raj. Doc state is now "the acat sat".{insert, pos:5, text:'b'} — shifted past Sarah's character. DS assigns server_op_id:1044, appends to log, broadcasts the transformed op to Sarah.insert at 4, but the server says insert at 5. It rewinds local state, replays Sarah's op at 4, then reapplies his op at 5. Final state on Raj's screen: "the abcat sat".insert at 5. Final state on Sarah's screen: "the abcat sat".OT's whole job is to take two ops that were both written against the same parent state and produce two transformed versions such that applying them in either order yields the same result. The proof is constructive: define a transform function for every pair of op types, and the algorithm just composes them.
Two clients both think the doc is "the cat". Op_A is insert(pos=4, text='x'). Op_B is insert(pos=4, text='y'). They were sent concurrently — both based on the same parent. The server picks an order: A first, then B. After applying A, the doc is "the xcat". Now we need to transform B so it inserts after A's character: insert(pos=5, text='y'). Applying that gives "the xycat". Both clients converge.
If B inserts before A's position, B is unchanged. If B inserts at or after A, shift B's position right by A's length. Ties (same position) are broken by a deterministic client_id rule so all sites pick the same order.
If A's insert lands before B's delete range, shift B's position. If A lands inside B's delete range, split B into two deletes around the inserted text — neither side wanted to delete what wasn't there.
If ranges don't overlap, just adjust positions. If they overlap, subtract the overlap from B's length — A already deleted it. If B is fully inside A's range, B becomes a no-op. Critical: never double-delete.
The cleanest way to guarantee convergence is to have one machine be the global ordering authority for each doc. That's also the most uncomfortable architectural pattern for a distributed system — stateful, sticky, single-point-of-something. Let's look at what that machine holds and how we keep it from being a brittle single point of failure.
One commodity server with 64 GB RAM can comfortably host ~10,000 active docs at 100KB each + buffers + connection overhead. With 1M concurrent editors spread across docs (avg 3 editors/doc → ~330K active docs), we need ~33 Doc Session Servers. With headroom for skew (one viral doc with 100 editors counts more), call it 50 servers.
Pinning each doc to one server gives us simple, correct OT. But it caps a single doc at one server's CPU. A pathological case — 500 people editing a viral document — could saturate one box. There's also a failover blast radius: if the server dies, every doc on it is offline until recovery.
Two-layer persistence: every op is durably appended to Cassandra synchronously before broadcast (this is what backs the "no edits lost" promise). Periodically, the doc state is snapshotted to S3 so we can load fast and prune old ops.
Every op the Doc Session Server accepts goes through INSERT INTO ops (doc_id, server_op_id, op_blob, author, ts) VALUES (...) with a partition key on doc_id and clustering on server_op_id ascending. Cassandra's LSM-tree storage makes this a pure-append workload — no random I/O, no read-modify-write. We use a quorum write (W=2 of 3 replicas) so no edit is acknowledged until two nodes have it on disk.
Why Cassandra: append-only ops are its happy path; sharding by doc_id falls out of the partition model; multi-region replication is built-in for global doc availability.
A background task on the Doc Session Server serializes the current state and uploads docs/abc123/snap_1042.bin to object storage. Cadence: every 1000 ops or 5 minutes, whichever first. After the snapshot is durable, ops up to that point can be archived from the hot Cassandra table to a cold tier.
Why object storage: snapshots are large blobs (~100KB-10MB), cheap to store, and accessed only on doc-load. S3's 11-nines durability is exactly what we want for content that must outlive any one database.
Sarah opens a doc that's been edited for 3 years. Naïvely, replaying 5 million ops on the client would take minutes. With snapshots:
snap_4998000.bin (~100KB blob).server_op_id > 4998000 — typically 50-1000 ops."Show me the doc as it was last Tuesday at 3pm" is a slightly different read pattern. Find the snapshot just before that timestamp, replay ops up to (but not past) the target timestamp. To restore that version: generate a fresh op stream that "diffs" from current state back to the target — apply through the normal Doc Session Server so other connected editors see the restoration as a series of natural ops, with normal undo/redo semantics.
Sarah on a transatlantic flight. No Wi-Fi for 8 hours. She opens her doc, edits for 90 minutes, makes 1500 changes. When she lands, she needs all her work to merge cleanly with whatever Raj has been doing on the ground. This is OT's strongest selling point — the same algorithm that handles 50ms latency handles 8-hour latency without changes.
server_op_id seen before disconnectlast_seen:1042 + count of buffered ops1043..head — the missed remote opsThe longer Sarah is gone, the more remote ops her local ops must transform against. In the worst case (8 hours offline, busy doc), the transform pass might be expensive — but it's still milliseconds per op, so even 10K transforms is <1s of CPU.
If Sarah's offline long enough that the doc has been re-snapshotted, she may need to fetch a newer snapshot and rebase her ops on it. Server detects this from her stale last_seen and ships her the new snapshot first.
If Raj deleted the section Sarah was editing, her edits are technically in a no-longer-existing region. OT handles this: her inserts become no-ops on a deleted range. UI surfaces a "your edits to deleted section" message with the recovered text.
Cursor positions and "who's online" data have a totally different shape from edit ops: frequent, ephemeral, never persisted, no transform needed. Forcing them through the OT pipeline would bloat the op log with throwaway data. We carry them on a separate Redis Pub/Sub channel.
{user_id, cursor_pos, color} — every time the cursor moves{user_id, selection_start, selection_end} — when text is highlighted{user_id, joined} / {user_id, left} — connection events{user_id, typing:true/false} — debounced "is typing now" indicator~10 cursor updates/sec per active editor at peak. With 1M concurrent editors that's 10M presence msgs/sec across the whole system — but each is <100 bytes and Redis Pub/Sub fans it out via doc-specific channels (only same-doc collaborators see each other's presence). One doc with 5 collaborators = 50 msgs/sec on its channel — trivial.
Sarah's cursor is at position 156. Raj inserts 10 characters at position 50. Sarah's cursor should now be at position 166, not 156 (it's still on the same word). The Doc Session Server, after applying Raj's op, fires a presence event to shift every other user's cursor positions accordingly. Without this, cursors would drift to incorrect positions every time someone else edited.
Comments are a feature of the doc but a separate data model. Tom highlights a sentence and writes "needs more detail" — that comment must stick to the sentence even as Sarah keeps editing the surrounding paragraphs.
Each comment stores an anchor_op_id — the server_op_id of the op that produced the text it refers to. As subsequent ops shift the anchored text around, the OT engine emits anchor-adjustment events that the Comments Service consumes to update the position. If the anchored text is fully deleted, the comment becomes "orphaned" and shows in a side panel without an inline marker.
Comments are a tree: root + replies. Stored as (doc_id, comment_id, parent_id, author, body, ts, resolved). Loaded on demand when the side panel opens — not streamed through WebSocket like ops, because they're rare and bursty.
A "suggestion" is an op that's marked pending instead of applied directly. Other editors see it as inline strikethrough/insertion-style markup. The author or an editor with permission accepts it (op promoted to applied) or rejects it (op discarded). The OT layer handles the math identically — the only difference is the rendering and the gating on accept.
Comment posted → emit event to Kafka → Notification Service ⑪ aggregates and sends email/in-app. We de-dupe so Tom doesn't get 50 emails for Sarah's 50-comment review session.
Every doc has an ACL — the table that says who can do what. Permission checks happen at two layers: the front door (WebSocket establish) and the inner kitchen (every write).
Before accepting a connection, the front-end checks that the user's session token + doc_id pair has at least viewer role in the ACL. No role → 403, connection refused. This is the cheap gate that keeps unauthorized users out entirely.
Inside the Doc Session Server, every op is checked against the user's role (cached from the open). Viewer trying to write → reject with error frame. The cache is invalidated when the Permissions Service emits a "role changed for this user" event.
Owner removes Sarah's edit access while she's actively editing → Permissions Service emits an event to the Doc Session Server → server downgrades her role mid-session and starts rejecting her ops; or for "removed entirely", server closes her WebSocket. Within 1-2 seconds her tab shows "you no longer have access".
Sarah clicks Share, enters Raj's email, picks "editor". Front-end POSTs to /api/v1/docs/abc123/share with body {user_email, role}. Permissions Service writes to the ACL table, then fires a notification event ("Sarah shared a doc with you"). Raj gets an email link; clicking it opens the doc with the editor role baked into his session.
18 PB of doc data and 25 TB/day of op logs do not fit on one box. We shard everywhere — but the sharding key is identical across services to keep things simple: partition by doc_id, hashed consistently.
Partition key on doc_id; clustering on server_op_id. All ops for one doc land on the same partition so they can be read in order with one query. Replication factor 3 across AZs.
Object key includes doc_id as a prefix; S3's internal partitioning handles the rest. Lifecycle rules move snapshots older than 90 days to Glacier for cold storage at 1/10 the cost.
Same consistent-hash ring on doc_id, so the same doc always lands on the same server (until reshuffle). The Session Locator caches the ring; on server-add or server-remove, only ~1/N of docs migrate.
Partition by doc_id. The hot read is "give me Sarah's role on doc abc123" — a point query, fast on either store. Permissions tables are tiny relative to op logs (one row per share, not per edit), so a single Postgres cluster shard could hold years of growth.
snap_4982000.bin from 2:55pm. Replay ops 4982001..N from Cassandra, where N is the last op-id with a timestamp ≤ 3pm. Stop. The reconstructed state is exactly what users saw at 3pm. To restore that as the live version, generate a delta op stream that takes the current doc back to that state — and feed those ops through the normal Doc Session Server so other connected editors see the restoration as a sequence of natural ops with normal animation and undo support.delete(pos=42, len=1). Server applies the first, deleting the character; assigns op-id 1043, broadcasts. Server receives the second, calls transform: "this delete is fully inside an already-deleted range → becomes a no-op." Op is logged as a no-op for audit, broadcast as a no-op, both clients converge. The second user's intention (remove that character) is satisfied by the first user's action — nothing was lost, nothing was double-applied.