Dropbox / Google Drive

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →

Step 1

Clarify Requirements

Sync isn't one feature — it's a contract between every device a user owns and the cloud. Pin the contract first. The interesting design constraint isn't "store files," it's "make any change visible everywhere within seconds while never losing data."

✅ Functional Requirements

Upload, download, view files from any device
Auto-sync changes across all of a user's devices
Share files / folders with other users (read or write)
Support large files — single file up to multiple GB
Offline edits queue and reconcile on reconnect
Version history / snapshots — rollback to a prior revision

⚡ Non-Functional Requirements

Durability: 11 nines — files cannot be lost
Consistency: ACID for metadata; eventual is fine for cross-device push
Availability: 99.95% — auth + metadata path must stay up
Bandwidth: minimize — ship only changed bytes (delta sync)
Latency: small-file save → visible on other device < 5 s

Out of scope (confirm with interviewer): client-side encryption (E2EE), real-time collaborative editing (that's a different problem — operational transforms / CRDTs), media transcoding, search inside file contents, mobile background-sync battery optimization.

Step 2

Capacity & Scale Estimates

The numbers drive every later choice — chunk size, shard count, and whether metadata fits on one box (it doesn't).

Metric	Assumption	Result
Total users	500 M	—
Daily active users	100 M	—
Devices per user	~3 (laptop + phone + tablet)	—
Files per user (avg)	200	`~100 B files total`
Avg file size	100 KB	—
Total file storage	100 B × 100 KB	`~10 PB`
Concurrent open connections (long-poll)	1 M / minute	—
Chunk size (design choice)	fixed 4 MB	balances IOPS vs delta overhead
Metadata per file	~1 KB (path, chunks, hashes, ACL)	`~100 TB metadata` → must shard

Takeaways: File bytes go to object storage (S3-class). Metadata is small but high-QPS — sits in a sharded RDBMS or NoSQL with a Memcache hot tier. Chunking @ 4 MB means a 1 GB file is 250 chunks — failures retry one chunk, not a gigabyte. And 1 M open long-poll connections means we need a lot of small, sticky front-end servers — or we move to WebSockets.

Step 3

Actors & Use Cases

Three actor types drive every flow. Color-coded so you can map them onto the architecture diagram below.

flowchart LR U([User]) CO([Collaborator]) AD([Admin]) U --> UP[Upload File] U --> DL[Download File] U --> ED[Edit / Save] U --> OFF[Offline Edit] U --> SH[Share Folder] CO --> VW[View Shared] CO --> EDS[Edit Shared] AD --> ACL[Manage ACL] AD --> AUD[Audit Logs] style U fill:#e8743b,stroke:#e8743b,color:#fff style CO fill:#4a90d9,stroke:#4a90d9,color:#fff style AD fill:#9b72cf,stroke:#9b72cf,color:#fff

📤 Upload

Chunk → hash → dedupe check → upload missing chunks → commit metadata. Failures retry per-chunk.

📥 Download

Read metadata → fetch chunks (cached at CDN/edge) → reassemble locally.

✏️ Edit / Save

Watcher detects diff → only changed chunks upload → metadata bumps version → notification fanout.

📵 Offline Edit

Local DB queues changes. On reconnect, conflict resolver runs — last-writer-wins or "Conflict copy".

🔗 Share

ACL row written to metadata. Collaborators see the folder in their next sync tick.

🛡️ Audit

Admins query who-did-what — a separate compliance store; the hot path never sees these queries.

Step 4

High-Level Architecture

This is the section most candidates rush. Don't. The architecture isn't five boxes connected by arrows — it's the answer to "what would the naive design get wrong, and how do we fix it without making the system unmaintainable?" We'll build it up in three passes: the naive version, why it breaks, and the production shape.

Pass 1 — the naive design (and why it dies)

Imagine the simplest thing that could work: one app server, one database, one disk. The client POSTs the file, the server writes bytes to disk, writes a row to the DB, returns OK.

flowchart LR C([Client]) -- POST file --> A[App Server] A -- write bytes --> D[(Local Disk
file content)] A -- write row --> DB[(Database
file metadata)] style C fill:#e8743b,stroke:#e8743b,color:#fff style A fill:#4a90d9,stroke:#4a90d9,color:#fff style D fill:#d4a838,stroke:#d4a838,color:#000 style DB fill:#9b72cf,stroke:#9b72cf,color:#fff

Why two stores even in the naive design? Files have two kinds of data with totally different access patterns. The bytes (a 12 MB doc, a 5 GB video) belong on a filesystem — disks stream big blobs well but can't answer "list every file Sarah owns starting with R." The metadata (file_id, owner, path, size, created_at — a few hundred bytes) belongs in a database — DBs do indexed lookups, joins, and ACL checks instantly but choke if you stuff terabytes of BLOBs into them (backups balloon, replication lags, queries accidentally stream gigabytes). So even the simplest sane design already splits storage by data shape: bytes → disk, structured records → DB. The production design just supercharges each side — bytes go to S3 + CDN (instead of one disk), metadata goes to sharded SQL + cache + queue (instead of one DB). The split itself is the same idea.

💥 Bandwidth cliff

One million users uploading 1 GB videos pumps every byte through your app servers. A 10 Gbps link saturates at ~80 concurrent uploads. App servers die first.

💥 Stateful servers

Files live on local disk → server X has the file, server Y doesn't. Load balancer must pin requests to the "right" server. Lose that disk, lose the file.

💥 No multi-device sync

How does the user's phone learn that the laptop just saved a new version? The naive design has no answer — there's no push channel, no event log, nothing.

The lesson: the moment files get big and devices get many, "one server does everything" stops working. Each problem above pushes us toward a specific component in the production design.

Pass 2 — split into two planes

The single most important architectural decision in this design: file bytes and file metadata travel on different paths through the system. Not just different databases — entirely different network paths, different scaling profiles, different failure modes.

📦 Data plane — bytes

Carries the actual file content. Big, slow, dumb. Optimized for throughput and durability, not latency.

Object storage (S3-class) — designed for petabyte scale & 11-nines durability
CDN edge cache for hot reads
Client talks directly to S3 via short-lived presigned URLs — our app servers are out of the byte path entirely
Scales horizontally by buying more storage, not more app servers

🧠 Control plane — metadata

Carries "what files exist, what versions, who owns them, who can access them, which chunks make them up." Tiny payloads, high QPS. Optimized for latency and consistency.

Sharded SQL DB for ACID guarantees on the chunk-list invariant
In-memory cache (Memcache/Redis) for the hot-read path
Message queue + notification servers to push changes to other devices
Scales by sharding on user_id / workspace_id

Why this split matters: a 1 GB upload is the data plane's problem (1 GB of bytes to S3). The act of "telling other devices the file changed" is the control plane's problem (a few KB of metadata + a fanout). They scale independently because they have nothing in common except a chunk hash. You can throw a viral 5 GB video at this system and it never threatens the metadata path.

Pass 3 — the production shape

Now expand each plane to its real components. Read this diagram twice — once following the orange "data plane" lines (chunks → S3), once following the blue "control plane" lines (metadata → sync → notify).

flowchart LR CL([① Client App
Watcher · Chunker · Indexer]) LB[② Load Balancer] MS[③ Metadata Service
API + business logic] SS[④ Sync Service
diff resolver] MQ[⑤ Message Queue
Kafka / SQS] NS[⑥ Notification Server
long-poll / WS] MDB[(⑦ Metadata DB
sharded SQL / NoSQL)] CACHE[(⑧ Metadata Cache
Memcache / Redis)] BS[⑨ Block Storage
S3 / object store] CDN[⑩ CDN / Edge Cache] CL == chunks ==> CDN CL == chunks ==> BS CL -- metadata ops --> LB --> MS MS --> CACHE MS --> MDB MS --> SS SS --> MQ MQ --> NS NS == push updates ==> CL style CL fill:#e8743b,stroke:#e8743b,color:#fff style MS fill:#4a90d9,stroke:#4a90d9,color:#fff style SS fill:#38b265,stroke:#38b265,color:#fff style MDB fill:#9b72cf,stroke:#9b72cf,color:#fff style CACHE fill:#9b72cf,stroke:#9b72cf,color:#fff style BS fill:#d4a838,stroke:#d4a838,color:#000 style MQ fill:#3cbfbf,stroke:#3cbfbf,color:#000 style NS fill:#3cbfbf,stroke:#3cbfbf,color:#000 style CDN fill:#d4a838,stroke:#d4a838,color:#000

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card below. For each component we cover what it does and — more importantly — what problem it solves: if you can't say what would break without it, it doesn't belong in the design.

① Client App

The desktop/mobile binary running on every user device. Not a thin shell — it carries real software: a Watcher that hooks into OS file-system events, a Chunker that splits files into 4 MB chunks and computes SHA-256 hashes, an Indexer backed by a local SQLite DB that mirrors the server's view of the user's tree, and a Network Layer that talks to both planes (S3 for bytes, metadata service for control).

Solves: offline edits, delta sync, and resumable uploads. Without local state, every save would round-trip to the server just to compute a diff.

② Load Balancer

Sits in front of the stateless metadata service tier. L7 (HTTP-aware), sticky-hashed on user_id so the same user's requests warm the same caches and reuse TCP connections. Doesn't touch byte uploads — those bypass it entirely via presigned S3 URLs.

Solves: horizontal scale + zero-downtime deploys. Without it, you can't add or remove app servers under load and a single server failure surfaces to users.

③ Metadata Service

Stateless application servers — the only tier that writes to the metadata DB. Owns all business logic: "is this user allowed to see this file?", "does this chunk hash already exist?", "what's the current version?", "is this within quota?" Three core APIs: BeginUpload (dedup interrogation), CommitUpload (atomic version flip), GetChangesSince (diff fetch).

Solves: single source of truth. Every metadata read & write funnels through here so invariants — ACLs, versioning, dedup, quotas — are enforced in one place. If clients wrote directly to the DB, you'd have N flavors of "almost correct."

④ Sync Service

Computes the diff between what a device last knew and what the server knows now. Triggered both ways: on a device poll ("what changed since v_99?") and on a server commit ("publish this change to subscribers"). Internally walks a per-workspace change log keyed by version number, so answering "what's new?" is O(changes), not O(files).

Solves: efficient sync. Without it, every reconnect would have to enumerate every file in the tree to find what changed — pathological for users with 100k+ files.

⑤ Message Queue (Kafka / SQS)

Buffers change events between the sync service (producer) and notification servers (consumers). Each ChangeEvent carries {ws_id, version_id, kind} — small, fast, ordered per workspace. Why insert a queue here? Because fanout is fundamentally async — a single commit may need to notify 50 000 subscribers; the upload should not wait for that fanout.

Solves: decoupling + back-pressure. Producer and consumers scale independently. A notification-server slowdown can't block the upload commit.

⑥ Notification Server

Holds long-poll (or WebSocket) connections from every online device. When the queue delivers a ChangeEvent for a workspace, it pushes to all subscribers of that workspace. Sticky-LB'd on workspace_id so subscribers consolidate on a few nodes per workspace — easier and cheaper to fan out. Mobile usually opts out and uses APNs/FCM pushes instead, for battery and data reasons.

Solves: real-time multi-device sync. Polling every few seconds is wasteful (most polls return nothing) and laggy. Push delivers near-instantly.

⑦ Metadata DB

Sharded relational DB — Postgres / MySQL / Spanner. Stores users, files, file_versions, file_chunks, chunks, acl tables. Sharded by user_id / workspace_id using consistent hashing so a user's entire tree co-locates on one shard — folder listings and ACL checks hit one box.

Solves: consistent state with ACID guarantees. The chunk-list-of-a-version invariant must be atomic — a half-written version is a corrupted file. NoSQL would force application-level reconciliation we don't need.

⑧ Metadata Cache

Memcache/Redis sitting in front of the metadata DB. Caches the read-heavy hot rows: open folders, ACL rows, recent file-version pointers, dedup-check results. Writes invalidate: when the metadata service commits a new version, it punches the cached file row. ACL TTLs are kept short (≤ 60 s) because stale shares are a security bug, not just a perf bug.

Solves: read amplification. Every "open folder" or "list workspace" call would hit the DB without a cache. At 95%+ hit rate the DB only sees writes + cache misses.

⑨ Block Storage (S3)

The actual chunk bytes live here, content-addressed by chunk_hash (SHA-256 of the chunk content). Same hash → same object → automatic dedup at the storage layer. Replication, erasure coding, cross-region failover, 11-nines durability — all handled by S3-class systems. Critically: clients upload directly via presigned URLs — bytes never traverse our app servers.

Solves: durability + petabyte scale. Building this ourselves would cost years and a whole org. We offload the hard problems and keep our own service tier small and stateless.

⑩ CDN / Edge Cache

Caches popular chunks at the network edge close to the user. Cache key is the immutable chunk_hash, so cache invalidation is a non-problem. Especially powerful for shared files: a viral "all-hands.mp4" downloaded by 50 000 employees → after the first few requests, every other download is served from a POP near the user.

Solves: origin pressure + global latency. Without a CDN, every download from Sydney round-trips to us-east. With it, the origin sees ~K requests instead of M, and users get sub-100 ms downloads regardless of geography.

Putting it together — a concrete walkthrough

Sarah edits roadmap.docx on her laptop. It's a 12 MB file; she changed two paragraphs. Here's exactly which components touch the request, and why each one is on the path.

Watcher sees the FS event on the laptop and notifies the chunker.
Chunker splits the new file into 3 × 4 MB chunks, hashes each, and compares against the local index. Two hashes match (unchanged); one is new.
Client calls BeginUpload on the Metadata Service through the Load Balancer — sends all 3 hashes.
Metadata Service hits the Metadata Cache first to dedup-check. Two hashes are present; one is missing. Returns "upload chunk #2 only."
Client uploads just chunk #2 directly to Block Storage via a presigned URL. Bytes never touch our app servers.
Client calls CommitUpload. Metadata Service writes a new FileVersion row + 3 FileChunk rows to the Metadata DB, atomically bumps current_version.
On commit, Metadata Service hands the changeset to the Sync Service, which publishes a ChangeEvent to the Message Queue.
The Notification Server holding Sarah's phone's long-poll connection receives the event from the queue and pushes "ws_42 changed."
Sarah's phone calls GetChangesSince on the Metadata Service, gets back "version v100, missing chunk #2."
Phone fetches chunk #2 from CDN (which pulls from Block Storage on cache miss), assembles the file locally. Done.

Trace the planes: steps 5 and 10 are the only times bytes move — that's the data plane. Steps 1–4 and 6–9 are all metadata + events — that's the control plane. The 12 MB upload contributed exactly one S3 PUT (4 MB) and a handful of small DB writes. That's the whole architectural payoff.

Key design call: separation of data and control. A large upload is "stream the missing chunks to S3 + write a few rows to the metadata DB" — never "POST 1 GB to our app server." Our service stays small and stateless; the storage tier does the heavy lifting; the queue absorbs notification spikes. Every component earns its place by solving a problem the naive design couldn't.

Step 5

Why Chunking is the Whole Game

If you remember one thing from this design: files are stored as a list of chunks, not as a single object. Every benefit of the design — dedup, delta sync, resumability — falls out of this single decision.

🔁 Resumability

1 GB upload drops at chunk 198/250 → resume from 198, not from zero.

📡 Delta sync

Edit one paragraph in a 50 MB doc → re-upload one 4 MB chunk, not 50 MB.

♻️ Dedup

Same chunk hash already exists? Skip the upload. Two users uploading the same movie pay storage once.

🚀 Parallelism

Upload 10 chunks in parallel. Saturate the user's uplink instead of TCP-streaming one big request.

Chunk-size trade-off

# Smaller chunks (e.g., 256 KB):
+  Better delta sync — finer granularity for edits
+  Faster retry on transient failures
-  More metadata rows per file (250 chunks/MB vs 0.25)
-  More object-store API calls → higher cost & latency

# Larger chunks (e.g., 16 MB):
+  Less metadata, fewer S3 calls
+  Better throughput per request
-  Re-uploading on tiny edits is wasteful
-  Failed chunk costs more bandwidth to retry

# Industry-standard sweet spot: 4 MB
# Dropbox uses 4 MB. Google Drive uses variable rolling-hash chunking
# (rabin fingerprinting) so identical content remains aligned across edits.

Advanced: fixed-size chunking has a famous weakness — insert one byte at the start of a file and every chunk boundary shifts, defeating dedup. Production systems use content-defined chunking (Rabin fingerprinting) where boundaries are determined by content hash, not byte offset. That way an insert only changes the chunks around the edit point.

Step 6

Client Components — More Than a Folder Watcher

The desktop client is a real piece of software with four cooperating modules. Understanding it earns you a lot of credit in interviews because most candidates wave their hands at "the client" and skip to the server.

flowchart LR W[Watcher
fs events] CH[Chunker
split + hash] IDX[Indexer
local SQLite] NET[Network Layer
HTTP / long-poll] W -- file changed --> CH CH -- chunks + hashes --> IDX IDX -- diff vs server --> NET NET -- upload chunks --> BS[(Block Storage)] NET -- update metadata --> MS[Metadata Svc] NET -. listen for remote changes .-> NS[Notification Server] NS -. push notifies .-> IDX style W fill:#e8743b,stroke:#e8743b,color:#fff style CH fill:#4a90d9,stroke:#4a90d9,color:#fff style IDX fill:#38b265,stroke:#38b265,color:#fff style NET fill:#9b72cf,stroke:#9b72cf,color:#fff

👁️ Watcher

Hooks into OS file-system events (inotify on Linux, FSEvents on macOS, ReadDirectoryChangesW on Windows). Debounces — a "save" can fire 5 events in 100 ms, we coalesce.

✂️ Chunker

Splits files into 4 MB chunks. Computes SHA-256 per chunk. Compares against the local index first — if the chunk hash matches the previous version, skip it. This is delta sync.

🗂️ Indexer (local SQLite)

The client's local "metadata DB" — file path, version, chunk list, last-synced timestamp. Survives restarts. Used to compute diffs and to support offline edits.

🌐 Network Layer

Three responsibilities: (1) upload chunks via presigned URLs, (2) call metadata APIs, (3) hold a long-poll connection to the notification server. Exponential back-off on failures.

Why the client carries a real database: without local state, every save would round-trip to the server to ask "what's the current version?" — that's both slow and broken offline. The local SQLite is the source of truth for "what this device thinks the world looks like" — sync is the act of reconciling that view with the server's view.

Step 7

Metadata Schema — Five Core Entities

Keep the schema small. The trick is recognizing that file and chunk are separate first-class entities — chunks have their own lifecycle, hash-keyed identity, and dedup story.

erDiagram USER ||--o{ FILE : owns USER ||--o{ DEVICE : registers USER ||--o{ WORKSPACE : has WORKSPACE ||--o{ FILE : contains FILE ||--o{ FILE_VERSION : "history of" FILE_VERSION ||--o{ FILE_CHUNK : "made of" CHUNK ||--o{ FILE_CHUNK : "referenced by" FILE ||--o{ ACL : "shared via" USER { string user_id PK string email timestamp created_at } DEVICE { string device_id PK string user_id FK string platform timestamp last_seen } WORKSPACE { string ws_id PK string user_id FK string root_path } FILE { string file_id PK string ws_id FK string path string current_version FK bigint size timestamp updated_at } FILE_VERSION { string version_id PK string file_id FK int rev_number string user_id FK timestamp created_at } CHUNK { string chunk_hash PK bigint size string storage_url int ref_count } FILE_CHUNK { string version_id FK int chunk_index string chunk_hash FK } ACL { string file_id FK string user_id FK string permission }

🔑 Why `chunk_hash` is the primary key

Identical content collapses to one row. Two users uploading the same PDF reuse the same chunks — storage cost is paid once. The ref_count field is what lets us safely garbage-collect chunks when no version references them anymore.

📜 Why versions are first-class

"Restore to yesterday" becomes a metadata-only operation: point FILE.current_version at an older row. No bytes move. Same trick powers undo, time-machine, and ransomware recovery.

Step 8

Upload Flow — End to End

Walk through what happens when User A saves a 1 GB file. The sequence below is the single most important diagram in the design — every chunk-skip and dedup choice is visible here.

sequenceDiagram actor A as Client A participant CH as Chunker participant MS as Metadata Svc participant BS as Block Storage participant SS as Sync Svc participant Q as Queue participant NS as Notification Svc actor B as Client B A->>CH: file changed (1 GB) CH->>CH: split into 250 × 4 MB chunks
compute SHA-256 per chunk CH->>MS: BeginUpload(file_id, version_id, [250 hashes]) MS->>MS: lookup chunks by hash
(dedup check) MS-->>A: missing_chunks = [12, 47, 198] (only 3 are new) loop for each missing chunk A->>BS: PUT chunk (via presigned URL) BS-->>A: 200 OK end A->>MS: CommitUpload(version_id, all 250 hashes) MS->>MS: write FileVersion + FileChunk rows
bump current_version MS->>SS: changeset for ws_id SS->>Q: publish ChangeEvent Q->>NS: deliver to subscribers of ws_id NS->>B: long-poll response: "ws_id changed" B->>MS: GetChangesSince(last_synced) MS-->>B: new version + chunk list B->>BS: GET only the chunks B doesn't have B->>B: assemble & write to local FS

1️⃣ Why BeginUpload first

Asking the server "which of these 250 hashes do you already have?" lets us skip 247 chunk uploads in the common edit-one-paragraph case. This is the dedup interrogation step.

2️⃣ Why presigned URLs

Bytes go client → S3 directly. Our app servers never proxy a gigabyte. Dramatically cheaper bandwidth and statelessness — any app server can handle the next request.

3️⃣ Why CommitUpload at the end

The version row is only written after every chunk is durable. A crash mid-upload leaves orphan chunks (cleaned up by a sweeper job) but never a corrupt visible version.

Atomicity trick: the new version isn't visible to other devices until CommitUpload bumps FILE.current_version. Until that single row update, B keeps reading the old version. This is how we get atomic file replacement on top of an eventually-consistent blob store.

Step 9

Sync & Notifications — How Devices Hear About Changes

Polling every 5 seconds wastes bandwidth and lags by 5 seconds in the worst case. Push is faster but harder. Production systems use HTTP long-polling (or WebSockets) — the client opens a connection that the server holds open until something happens.

sequenceDiagram participant B as Client B participant LB as Load Balancer participant NS as Notification Svc participant Q as Queue B->>LB: GET /v1/changes/poll?ws=ws_42&since=v_99 LB->>NS: route (sticky by ws_id) NS->>NS: register subscriber for ws_42
start 30s timer Note over NS: holds the request open Q->>NS: ChangeEvent ws_42 v_100 NS-->>B: 200 { changes: [...] } B->>B: process changes B->>LB: GET /v1/changes/poll?ws=ws_42&since=v_100 Note over B,NS: cycle repeats

📥 Pull / Polling

Simple — just retry on a timer. Wasteful — most polls return nothing. Latency upper-bounded by poll interval. Fine for low-priority sync (mobile on cellular).

📤 Long-poll

Server holds the request until data is available or 30 s elapses. Near-instant push without the operational complexity of WebSockets. Works through corporate proxies. Default for desktop sync.

🔌 WebSockets

Persistent bidirectional. Lower per-message overhead than long-poll. Right answer when notification volume is high (e.g., active collab session). Cost: stateful connection management.

Mobile twist: mobile clients usually don't long-poll — battery and data cost too much. They subscribe to APNs / FCM push channels and only sync on user action or system push. The notification service publishes to APNs/FCM in addition to its own long-poll listeners.

Step 10

Data Deduplication — Where to Do It

Deduplication is "don't store the same chunk twice." The where matters: doing it on the client saves bandwidth; doing it on the server saves only storage.

📡 In-line (client-side, recommended)

Client hashes every chunk first, asks the server "got this hash?", uploads only the misses. Saves storage + bandwidth + battery. The cost: one extra round-trip before each upload.

Used by Dropbox, OneDrive
Wins big for backups and shared docs

🧹 Post-process (server-side)

Client uploads everything. A background job scans hashes and merges duplicates. Simpler client; wastes upload bandwidth.

Used when bandwidth is cheap and clients are dumb
Lets the upload path stay synchronous & trivial

Real-world payoff: on a corporate Dropbox account, > 50% of "uploaded" bytes are skipped because the chunks already exist somewhere in the global pool. The dedup table is one of the most cost-leveraged data structures in the entire stack.

Privacy footnote: chunk-hash dedup across different users has a subtle attack — knowing a hash exists in the system tells you someone, somewhere, has that file. Production systems either dedup only within a single user/team, or use convergent encryption to make hashes per-tenant.

Step 11

Metadata Partitioning — When 100 TB Doesn't Fit on One Box

Three sharding strategies, each with a failure mode. Pick the one whose failure mode you can live with.

Scheme	How it works	Wins	Loses
Vertical	Tables on different DBs (users on DB1, files on DB2)	Easy first step; isolates feature load	Cross-table joins now go cross-network. Hits a ceiling fast.
Range (by path prefix)	All paths starting with `A` on shard 1, `B` on shard 2…	Predictable; range queries fast	Hot shards (everyone has `/Documents`); rebalancing painful.
Hash (consistent hashing on `file_id`)	Hash function maps each file to a shard	Uniform load; easy to add shards (consistent hashing)	Listing a folder now hits multiple shards — needs scatter-gather.

Pragmatic answer: shard by user_id (or workspace_id). Co-locates a user's whole tree on one shard so folder listings and ACL checks hit one box. Use consistent hashing so adding shards moves only ~1/N of the keys, not all of them. This is the same trick Cassandra and DynamoDB use under the hood.

Step 12

Caching & Load Balancing

Two caches do almost all the work. Three load-balancing layers cover the rest.

⚡ Block Cache (CDN / edge)

Hot chunks served from CDN — never touch our origin S3. Especially powerful for shared files and "training_video.mp4" that 50 K employees download in the same week.

LRU eviction by chunk popularity
~144 GB on a commodity node ≈ 36 K chunks @ 4 MB
Cache key = chunk_hash (immutable, never invalidated)

🧠 Metadata Cache (Memcache)

Caches FILE + FILE_VERSION + ACL rows. Read-heavy: every folder open is a metadata read. Writes go through to the DB; cache invalidates on commit.

Hit rate target: > 95%
TTL short (e.g., 60 s) for ACL — stale shares are a security bug
Pin hot folders (root + recents)

LB · client → metadata svc

L7 load balancer; sticky on user_id hash for connection reuse.

LB · client → notification svc

Sticky on ws_id so a workspace's subscribers consolidate on a few nodes — easier fanout.

LB · client → block storage

Usually the CDN itself. Origin S3 sits behind global anycast.

Step 13

Security, Permissions & Sharing

The interesting question isn't "how do we encrypt?" — it's "what does shared mean when the file lives as 250 chunks scattered across an object store?"

🔐 Permissions model

ACL row per (file_id, user_id) with permission level (read / write / admin)
Folder-level ACLs propagate down at query time (denormalize for hot reads)
Public links = ACL with a tokenized "anonymous" principal + optional expiry
Every metadata read does an ACL check; cached but with short TTL

🛡️ Encryption layers

In transit: TLS 1.3 client ↔ everything
At rest (server-side): S3 SSE-KMS — keys per workspace
Block-level integrity: SHA-256 verified on download
Optional E2EE tier — keys never leave client; loses dedup & preview

Sharing gotcha: when you "unshare" a folder, you must invalidate any cached ACL and any signed download URL the user might still hold. Short URL expiries (≤ 5 min) make this tractable — long-lived URLs become a revocation nightmare.

Step 14

Trade-offs & Talking Points

What you'd say in the interview when asked "why did you do X and not Y?"

Decision	Alternative	Why this choice
Fixed 4 MB chunks	Variable (Rabin-fingerprint) chunks	Simpler client; "good enough" dedup. Variable adds CPU cost on every save and complicates the chunk store.
In-line (client) dedup	Post-process server dedup	Saves bandwidth on the upload, not just storage. Costs one extra RTT but it's tiny.
HTTP long-poll for notifications	WebSockets / SSE	Long-poll traverses corporate proxies cleanly and is easier to load-balance. WS is better for high-frequency push but overkill for sync.
Sharded SQL for metadata	Single Cassandra cluster	Metadata invariants (chunk lists, version pointers) want ACID. Cassandra would force application-level reconciliation we don't need.
Direct client → S3 uploads	Proxy through app servers	Saves a full bandwidth hop. Lets app tier stay stateless. Presigned URLs handle the auth handoff.
Eventual consistency for cross-device sync	Strong (Spanner-style) consistency	Few-second sync delay is fine for a file sync product. Strong consistency would 10× the cost for no perceivable user benefit.

Step 15

Interview Q&A — The Follow-up Round

If you nail the architecture, the next 15 minutes are these. Have answers ready.

Two devices edit the same file offline. Both come online. What happens?

Conflict resolution. Server's metadata service detects two versions claiming to derive from the same parent revision. Strategy: last-writer-wins for the canonical name, and the loser is renamed to file (Conflict copy from device-X 2026-05-02).ext. Surface the conflict to the user — never silently lose data. Real CRDT-style merge requires content-aware logic (text vs image vs binary) and is out of scope for plain file sync.

A user uploads a 5 GB file and their connection drops at 60%.

Resume from chunk 750/1250. The chunker re-hashes from where the upload paused. The BeginUpload call returns "missing" only for chunks 750+, since 0–749 are durable in S3. Client uploads the remaining 500. Total wasted bandwidth: zero — and crucially, no full restart. This is why chunking exists.

How do you garbage-collect orphan chunks (uploaded but never committed)?

Reference counting + a sweeper. Each chunk row tracks ref_count, incremented when a FileVersion references it and decremented when the version is deleted or rolled back. A nightly job lists chunks with ref_count = 0 AND created_at < now() - 24h and deletes them from S3. The 24-hour grace handles in-progress uploads.

A celebrity uploads a viral video — 1 M users download it in the same hour. How does your system survive?

The CDN absorbs it. All chunks for that file get pulled to edge caches on the first few requests; the remaining 999 K download from the user's nearest POP, never our origin. Origin S3 sees ~1 K requests, not 1 M. Metadata is cached in Memcache so the GetFileVersion calls are mostly hot reads. The bottleneck shifts to our notification fanout if those 1 M users are all subscribed to the same shared workspace — which is why workspaces should be sharded across notification servers, not pinned to one.

How do you prevent a malicious user from uploading 100 TB to fill our quota and DOS others?

Per-user quota + per-IP rate limit + chunk-level dedup. Quota is enforced on CommitUpload — counts the committed bytes of new (non-deduped) chunks against the user's plan. Rate limiting at the API gateway caps RPS. Dedup ironically helps DOS resistance — uploading the same chunk 1000× costs the attacker 1000 round-trips but us only one row update.

Why don't you use Kafka for everything instead of having a separate metadata DB?

Different consistency needs. "What chunks make up the current version of file X" is an invariant we need to atomically commit and read consistently — that's a relational write. Kafka is great for "tell every device that file X changed" — that's a fan-out event. Mixing them puts your invariants behind eventual consistency and makes the API surface confused. Use the right tool: SQL for invariants, Kafka for events.

You said metadata is sharded by user_id. What about shared folders?

Owner of the folder owns the shard. Collaborator's metadata service reads the folder by routing to the owner's shard. Cached aggressively. Yes, this means a viral shared folder creates a hotspot on its owner's shard — mitigated by cache, and by promoting truly large shared spaces (companies, communities) to their own dedicated shards via consistent-hashing virtual nodes.

Step 16

Production Checklist

What you'd verify before shipping. Order roughly maps to severity.

Chunks immutable and content-addressed by SHA-256 — never mutated in place
CommitUpload is the only operation that bumps current_version — atomic single-row write
Reference counting on chunks + nightly orphan sweeper with 24 h grace window
Presigned upload URLs expire in ≤ 15 min; download URLs in ≤ 5 min
Per-user quota enforced at commit time, not at upload time (dedup-aware)
ACL cache TTL ≤ 60 s — stale shares are a security bug, not a perf bug
Metadata DB sharded with consistent hashing on user_id / workspace_id
Notification servers consolidate subscribers per workspace_id (sticky LB)
Mobile uses APNs/FCM, not long-poll — battery + data discipline
CDN warm-up for shared "team announcement" content via origin pre-fetch
End-to-end SHA-256 verification on download — corrupted chunks fail loud
Conflict copies surfaced to the user, never silently overwritten
Audit log on a separate store (Cassandra / S3+Athena) — never on the hot metadata path
Disaster recovery: cross-region async replication of both metadata and block storage

The principle: every "feature" of Dropbox — resumable uploads, rollback, sharing, dedup, offline edits — falls out of two design choices: files-as-chunks and data plane separated from control plane. Get those two right and the rest is plumbing. Get them wrong and no amount of caching will save you.