A globally consistent file sync service — chunked uploads, in-line dedup, metadata sharding, cloud block storage, and a long-poll notification fabric that keeps every device in step.
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
Sync isn't one feature — it's a contract between every device a user owns and the cloud. Pin the contract first. The interesting design constraint isn't "store files," it's "make any change visible everywhere within seconds while never losing data."
The numbers drive every later choice — chunk size, shard count, and whether metadata fits on one box (it doesn't).
| Metric | Assumption | Result |
|---|---|---|
| Total users | 500 M | — |
| Daily active users | 100 M | — |
| Devices per user | ~3 (laptop + phone + tablet) | — |
| Files per user (avg) | 200 | ~100 B files total |
| Avg file size | 100 KB | — |
| Total file storage | 100 B × 100 KB | ~10 PB |
| Concurrent open connections (long-poll) | 1 M / minute | — |
| Chunk size (design choice) | fixed 4 MB | balances IOPS vs delta overhead |
| Metadata per file | ~1 KB (path, chunks, hashes, ACL) | ~100 TB metadata → must shard |
Three actor types drive every flow. Color-coded so you can map them onto the architecture diagram below.
Chunk → hash → dedupe check → upload missing chunks → commit metadata. Failures retry per-chunk.
Read metadata → fetch chunks (cached at CDN/edge) → reassemble locally.
Watcher detects diff → only changed chunks upload → metadata bumps version → notification fanout.
Local DB queues changes. On reconnect, conflict resolver runs — last-writer-wins or "Conflict copy".
ACL row written to metadata. Collaborators see the folder in their next sync tick.
Admins query who-did-what — a separate compliance store; the hot path never sees these queries.
This is the section most candidates rush. Don't. The architecture isn't five boxes connected by arrows — it's the answer to "what would the naive design get wrong, and how do we fix it without making the system unmaintainable?" We'll build it up in three passes: the naive version, why it breaks, and the production shape.
Imagine the simplest thing that could work: one app server, one database, one disk. The client POSTs the file, the server writes bytes to disk, writes a row to the DB, returns OK.
One million users uploading 1 GB videos pumps every byte through your app servers. A 10 Gbps link saturates at ~80 concurrent uploads. App servers die first.
Files live on local disk → server X has the file, server Y doesn't. Load balancer must pin requests to the "right" server. Lose that disk, lose the file.
How does the user's phone learn that the laptop just saved a new version? The naive design has no answer — there's no push channel, no event log, nothing.
The single most important architectural decision in this design: file bytes and file metadata travel on different paths through the system. Not just different databases — entirely different network paths, different scaling profiles, different failure modes.
Carries the actual file content. Big, slow, dumb. Optimized for throughput and durability, not latency.
Carries "what files exist, what versions, who owns them, who can access them, which chunks make them up." Tiny payloads, high QPS. Optimized for latency and consistency.
user_id / workspace_idNow expand each plane to its real components. Read this diagram twice — once following the orange "data plane" lines (chunks → S3), once following the blue "control plane" lines (metadata → sync → notify).
Use the numbers in the diagram above to find the matching card below. For each component we cover what it does and — more importantly — what problem it solves: if you can't say what would break without it, it doesn't belong in the design.
The desktop/mobile binary running on every user device. Not a thin shell — it carries real software: a Watcher that hooks into OS file-system events, a Chunker that splits files into 4 MB chunks and computes SHA-256 hashes, an Indexer backed by a local SQLite DB that mirrors the server's view of the user's tree, and a Network Layer that talks to both planes (S3 for bytes, metadata service for control).
Solves: offline edits, delta sync, and resumable uploads. Without local state, every save would round-trip to the server just to compute a diff.
Sits in front of the stateless metadata service tier. L7 (HTTP-aware), sticky-hashed on user_id so the same user's requests warm the same caches and reuse TCP connections. Doesn't touch byte uploads — those bypass it entirely via presigned S3 URLs.
Solves: horizontal scale + zero-downtime deploys. Without it, you can't add or remove app servers under load and a single server failure surfaces to users.
Stateless application servers — the only tier that writes to the metadata DB. Owns all business logic: "is this user allowed to see this file?", "does this chunk hash already exist?", "what's the current version?", "is this within quota?" Three core APIs: BeginUpload (dedup interrogation), CommitUpload (atomic version flip), GetChangesSince (diff fetch).
Solves: single source of truth. Every metadata read & write funnels through here so invariants — ACLs, versioning, dedup, quotas — are enforced in one place. If clients wrote directly to the DB, you'd have N flavors of "almost correct."
Computes the diff between what a device last knew and what the server knows now. Triggered both ways: on a device poll ("what changed since v_99?") and on a server commit ("publish this change to subscribers"). Internally walks a per-workspace change log keyed by version number, so answering "what's new?" is O(changes), not O(files).
Solves: efficient sync. Without it, every reconnect would have to enumerate every file in the tree to find what changed — pathological for users with 100k+ files.
Buffers change events between the sync service (producer) and notification servers (consumers). Each ChangeEvent carries {ws_id, version_id, kind} — small, fast, ordered per workspace. Why insert a queue here? Because fanout is fundamentally async — a single commit may need to notify 50 000 subscribers; the upload should not wait for that fanout.
Solves: decoupling + back-pressure. Producer and consumers scale independently. A notification-server slowdown can't block the upload commit.
Holds long-poll (or WebSocket) connections from every online device. When the queue delivers a ChangeEvent for a workspace, it pushes to all subscribers of that workspace. Sticky-LB'd on workspace_id so subscribers consolidate on a few nodes per workspace — easier and cheaper to fan out. Mobile usually opts out and uses APNs/FCM pushes instead, for battery and data reasons.
Solves: real-time multi-device sync. Polling every few seconds is wasteful (most polls return nothing) and laggy. Push delivers near-instantly.
Sharded relational DB — Postgres / MySQL / Spanner. Stores users, files, file_versions, file_chunks, chunks, acl tables. Sharded by user_id / workspace_id using consistent hashing so a user's entire tree co-locates on one shard — folder listings and ACL checks hit one box.
Solves: consistent state with ACID guarantees. The chunk-list-of-a-version invariant must be atomic — a half-written version is a corrupted file. NoSQL would force application-level reconciliation we don't need.
Memcache/Redis sitting in front of the metadata DB. Caches the read-heavy hot rows: open folders, ACL rows, recent file-version pointers, dedup-check results. Writes invalidate: when the metadata service commits a new version, it punches the cached file row. ACL TTLs are kept short (≤ 60 s) because stale shares are a security bug, not just a perf bug.
Solves: read amplification. Every "open folder" or "list workspace" call would hit the DB without a cache. At 95%+ hit rate the DB only sees writes + cache misses.
The actual chunk bytes live here, content-addressed by chunk_hash (SHA-256 of the chunk content). Same hash → same object → automatic dedup at the storage layer. Replication, erasure coding, cross-region failover, 11-nines durability — all handled by S3-class systems. Critically: clients upload directly via presigned URLs — bytes never traverse our app servers.
Solves: durability + petabyte scale. Building this ourselves would cost years and a whole org. We offload the hard problems and keep our own service tier small and stateless.
Caches popular chunks at the network edge close to the user. Cache key is the immutable chunk_hash, so cache invalidation is a non-problem. Especially powerful for shared files: a viral "all-hands.mp4" downloaded by 50 000 employees → after the first few requests, every other download is served from a POP near the user.
Solves: origin pressure + global latency. Without a CDN, every download from Sydney round-trips to us-east. With it, the origin sees ~K requests instead of M, and users get sub-100 ms downloads regardless of geography.
Sarah edits roadmap.docx on her laptop. It's a 12 MB file; she changed two paragraphs. Here's exactly which components touch the request, and why each one is on the path.
BeginUpload on the Metadata Service through the Load Balancer — sends all 3 hashes.CommitUpload. Metadata Service writes a new FileVersion row + 3 FileChunk rows to the Metadata DB, atomically bumps current_version.ChangeEvent to the Message Queue.GetChangesSince on the Metadata Service, gets back "version v100, missing chunk #2."If you remember one thing from this design: files are stored as a list of chunks, not as a single object. Every benefit of the design — dedup, delta sync, resumability — falls out of this single decision.
1 GB upload drops at chunk 198/250 → resume from 198, not from zero.
Edit one paragraph in a 50 MB doc → re-upload one 4 MB chunk, not 50 MB.
Same chunk hash already exists? Skip the upload. Two users uploading the same movie pay storage once.
Upload 10 chunks in parallel. Saturate the user's uplink instead of TCP-streaming one big request.
# Smaller chunks (e.g., 256 KB): + Better delta sync — finer granularity for edits + Faster retry on transient failures - More metadata rows per file (250 chunks/MB vs 0.25) - More object-store API calls → higher cost & latency # Larger chunks (e.g., 16 MB): + Less metadata, fewer S3 calls + Better throughput per request - Re-uploading on tiny edits is wasteful - Failed chunk costs more bandwidth to retry # Industry-standard sweet spot: 4 MB # Dropbox uses 4 MB. Google Drive uses variable rolling-hash chunking # (rabin fingerprinting) so identical content remains aligned across edits.
The desktop client is a real piece of software with four cooperating modules. Understanding it earns you a lot of credit in interviews because most candidates wave their hands at "the client" and skip to the server.
Hooks into OS file-system events (inotify on Linux, FSEvents on macOS, ReadDirectoryChangesW on Windows). Debounces — a "save" can fire 5 events in 100 ms, we coalesce.
Splits files into 4 MB chunks. Computes SHA-256 per chunk. Compares against the local index first — if the chunk hash matches the previous version, skip it. This is delta sync.
The client's local "metadata DB" — file path, version, chunk list, last-synced timestamp. Survives restarts. Used to compute diffs and to support offline edits.
Three responsibilities: (1) upload chunks via presigned URLs, (2) call metadata APIs, (3) hold a long-poll connection to the notification server. Exponential back-off on failures.
Keep the schema small. The trick is recognizing that file and chunk are separate first-class entities — chunks have their own lifecycle, hash-keyed identity, and dedup story.
chunk_hash is the primary keyIdentical content collapses to one row. Two users uploading the same PDF reuse the same chunks — storage cost is paid once. The ref_count field is what lets us safely garbage-collect chunks when no version references them anymore.
"Restore to yesterday" becomes a metadata-only operation: point FILE.current_version at an older row. No bytes move. Same trick powers undo, time-machine, and ransomware recovery.
Walk through what happens when User A saves a 1 GB file. The sequence below is the single most important diagram in the design — every chunk-skip and dedup choice is visible here.
Asking the server "which of these 250 hashes do you already have?" lets us skip 247 chunk uploads in the common edit-one-paragraph case. This is the dedup interrogation step.
Bytes go client → S3 directly. Our app servers never proxy a gigabyte. Dramatically cheaper bandwidth and statelessness — any app server can handle the next request.
The version row is only written after every chunk is durable. A crash mid-upload leaves orphan chunks (cleaned up by a sweeper job) but never a corrupt visible version.
CommitUpload bumps FILE.current_version. Until that single row update, B keeps reading the old version. This is how we get atomic file replacement on top of an eventually-consistent blob store.Polling every 5 seconds wastes bandwidth and lags by 5 seconds in the worst case. Push is faster but harder. Production systems use HTTP long-polling (or WebSockets) — the client opens a connection that the server holds open until something happens.
Simple — just retry on a timer. Wasteful — most polls return nothing. Latency upper-bounded by poll interval. Fine for low-priority sync (mobile on cellular).
Server holds the request until data is available or 30 s elapses. Near-instant push without the operational complexity of WebSockets. Works through corporate proxies. Default for desktop sync.
Persistent bidirectional. Lower per-message overhead than long-poll. Right answer when notification volume is high (e.g., active collab session). Cost: stateful connection management.
Deduplication is "don't store the same chunk twice." The where matters: doing it on the client saves bandwidth; doing it on the server saves only storage.
Client hashes every chunk first, asks the server "got this hash?", uploads only the misses. Saves storage + bandwidth + battery. The cost: one extra round-trip before each upload.
Client uploads everything. A background job scans hashes and merges duplicates. Simpler client; wastes upload bandwidth.
Three sharding strategies, each with a failure mode. Pick the one whose failure mode you can live with.
| Scheme | How it works | Wins | Loses |
|---|---|---|---|
| Vertical | Tables on different DBs (users on DB1, files on DB2) | Easy first step; isolates feature load | Cross-table joins now go cross-network. Hits a ceiling fast. |
| Range (by path prefix) | All paths starting with A on shard 1, B on shard 2… |
Predictable; range queries fast | Hot shards (everyone has /Documents); rebalancing painful. |
Hash (consistent hashing on file_id) |
Hash function maps each file to a shard | Uniform load; easy to add shards (consistent hashing) | Listing a folder now hits multiple shards — needs scatter-gather. |
user_id (or workspace_id). Co-locates a user's whole tree on one shard so folder listings and ACL checks hit one box. Use consistent hashing so adding shards moves only ~1/N of the keys, not all of them. This is the same trick Cassandra and DynamoDB use under the hood.Two caches do almost all the work. Three load-balancing layers cover the rest.
Hot chunks served from CDN — never touch our origin S3. Especially powerful for shared files and "training_video.mp4" that 50 K employees download in the same week.
chunk_hash (immutable, never invalidated)Caches FILE + FILE_VERSION + ACL rows. Read-heavy: every folder open is a metadata read. Writes go through to the DB; cache invalidates on commit.
L7 load balancer; sticky on user_id hash for connection reuse.
Sticky on ws_id so a workspace's subscribers consolidate on a few nodes — easier fanout.
Usually the CDN itself. Origin S3 sits behind global anycast.
The interesting question isn't "how do we encrypt?" — it's "what does shared mean when the file lives as 250 chunks scattered across an object store?"
What you'd say in the interview when asked "why did you do X and not Y?"
| Decision | Alternative | Why this choice |
|---|---|---|
| Fixed 4 MB chunks | Variable (Rabin-fingerprint) chunks | Simpler client; "good enough" dedup. Variable adds CPU cost on every save and complicates the chunk store. |
| In-line (client) dedup | Post-process server dedup | Saves bandwidth on the upload, not just storage. Costs one extra RTT but it's tiny. |
| HTTP long-poll for notifications | WebSockets / SSE | Long-poll traverses corporate proxies cleanly and is easier to load-balance. WS is better for high-frequency push but overkill for sync. |
| Sharded SQL for metadata | Single Cassandra cluster | Metadata invariants (chunk lists, version pointers) want ACID. Cassandra would force application-level reconciliation we don't need. |
| Direct client → S3 uploads | Proxy through app servers | Saves a full bandwidth hop. Lets app tier stay stateless. Presigned URLs handle the auth handoff. |
| Eventual consistency for cross-device sync | Strong (Spanner-style) consistency | Few-second sync delay is fine for a file sync product. Strong consistency would 10× the cost for no perceivable user benefit. |
If you nail the architecture, the next 15 minutes are these. Have answers ready.
file (Conflict copy from device-X 2026-05-02).ext. Surface the conflict to the user — never silently lose data. Real CRDT-style merge requires content-aware logic (text vs image vs binary) and is out of scope for plain file sync.BeginUpload call returns "missing" only for chunks 750+, since 0–749 are durable in S3. Client uploads the remaining 500. Total wasted bandwidth: zero — and crucially, no full restart. This is why chunking exists.ref_count, incremented when a FileVersion references it and decremented when the version is deleted or rolled back. A nightly job lists chunks with ref_count = 0 AND created_at < now() - 24h and deletes them from S3. The 24-hour grace handles in-progress uploads.GetFileVersion calls are mostly hot reads. The bottleneck shifts to our notification fanout if those 1 M users are all subscribed to the same shared workspace — which is why workspaces should be sharded across notification servers, not pinned to one.CommitUpload — counts the committed bytes of new (non-deduped) chunks against the user's plan. Rate limiting at the API gateway caps RPS. Dedup ironically helps DOS resistance — uploading the same chunk 1000× costs the attacker 1000 round-trips but us only one row update.What you'd verify before shipping. Order roughly maps to severity.
CommitUpload is the only operation that bumps current_version — atomic single-row writeuser_id / workspace_idworkspace_id (sticky LB)Did this flip your perspective on Dropbox? If it clicked, tap the ❤️ — that's how I know it hit.