From "store one MP4 on a disk" to a globally-distributed, transcode-driven, CDN-served streaming platform — the architecture that turns 25GB of uploads per second into 1080p that starts in 200ms anywhere on Earth
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
Imagine Priya, a creator in Bangalore, finishes editing a 1GB 4K travel vlog at 14:00 IST. She hits "Upload". Twelve hours later, Raj — a viewer in Mumbai sitting on a wobbly 6 Mbps mobile connection — taps her thumbnail. He expects the video to start playing in under a second, at the highest quality his bandwidth can sustain, with zero rebuffering. Those are wildly different bytes traveling wildly different paths: Priya's pristine 4K source has to be sliced, transcoded into half a dozen resolutions, replicated worldwide, and cached at the edge — so that by the time Raj presses play, there's a 720p version sitting on a server 30 km from his phone, ready to stream chunk by chunk.
That entire journey — from one creator's "Upload" click to a billion viewers tapping "Play" — is what we're designing. Youtube and Netflix differ in catalog (user-generated vs. licensed) but the streaming spine is identical: ingest → transcode → store → serve from edge.
Before drawing a single box, lock down what the system must do. In an interview, asking these questions out loud shows you're not memory-dumping a solution — you're scoping the problem.
Numbers are not optional in HLD. They drive every architectural choice — sharding, caching, CDN sizing — so do them out loud, even if rough. Video streaming is wildly read-heavy: assume a 200:1 view-to-upload ratio.
Assume 1.5B total users, 800M daily active users, each watching 5 videos/day. That's 4 billion video views/day — roughly 46K views/sec. With a 1:200 upload-to-view ratio, we get ~230 uploads/sec.
800 M
Out of 1.5B total
~46 K
800M × 5 / 86400
~230
1:200 view ratio
~500 hrs
Real-world Youtube number
If 500 hours of video are uploaded per minute, and we assume an average bitrate of ~50 MB/min (a mid-range encode for 1080p), that's a raw ingest of:
500 hrs × 60 min × 50 MB = 1,500,000 MB/min ≈ 25 GB/sec of raw upload bytes.
But we don't store just the original — we transcode each upload into 5 resolution variants (240p, 480p, 720p, 1080p, 4K) and 2-3 codecs (H.264, VP9, AV1). After dedup and smarter encoding, total stored bytes per minute of source video roughly 2-3× the raw size. Over a year that's hundreds of petabytes — a number only large object stores can hold.
Inbound (uploads): 230 × ~25 MB avg = ~5 GB/sec ingress. That's manageable.
Outbound (playback): with a 200:1 view-to-upload ratio, egress is 200× ingress: ~1 TB/sec sustained outgoing. That number is the single most important constraint in the design — and it explains why CDN is non-negotiable, not optional.
| Metric | Value | Why it matters |
|---|---|---|
| Views/sec | 46 K/s | Drives CDN edge fleet sizing |
| Uploads/sec | 230/s | Drives upload service & transcode farm sizing |
| Raw ingest | 25 GB/s | Forces distributed object store, not a SAN |
| Egress | ~1 TB/s | Physically requires a CDN — no origin can do this |
| Hot cache | tens of TB | Justifies multi-tier cache: edge → regional → origin |
Three endpoints carry the system: upload a video, search the catalog, and stream a video. The third is the unusual one — it returns a stream of bytes, not JSON.
REST API surface// Upload — slow, large payload, resumable POST /api/v1/videos Headers: { "X-API-Key": "abc123..." } Body (multipart): { "title": "Goa Travel Vlog 2026", "description": "Sunsets, beaches, and street food", "tags": ["travel","goa","vlog","2026"], "category": "Travel", "language": "en", "video_blob": <binary 1 GB MP4> } → 202 Accepted { "video_id": "v_8f2a1c", "state": "PROCESSING" } // Search — keyword query against the catalog GET /api/v1/search?q=goa+sunset&count=20&page_token=... → 200 OK { "results": [ { "video_id": "v_8f2a1c", "title": "Goa Travel Vlog 2026", "thumbnail_url": "https://cdn.../thumb_8f2a1c.jpg", "duration_sec": 612, "views": 1250 }, ... ], "next_page_token": "eyJ..." } // Stream — returns BYTES not JSON; HLS/DASH manifest + chunked video GET /api/v1/videos/:video_id/stream?codec=h264&resolution=720p&offset=0 → 200 OK Content-Type: application/vnd.apple.mpegurl // HLS manifest // or video/mp4 for a single chunk // CDN serves this; origin never sees it in steady state
streamVideo returns a STREAM not JSON: a 1080p video is hundreds of MB. We don't ship it in one response — we ship a tiny manifest (HLS .m3u8 or DASH .mpd) that lists 10-second chunk URLs. The player downloads chunks on demand, picks resolution per chunk based on current bandwidth (adaptive bitrate), and buffers ~30 seconds ahead. This is what makes "real-time playback" possible — the first chunk is served in 200ms while the rest is fetched in the background.X-API-Key: upload is expensive (storage + transcode + CDN). Tie every upload to an authenticated user and rate-limit per user (e.g., 10 uploads/hour for free tier). Search is rate-limited per IP to deter scrapers building competitor catalogs.Three completely different data shapes live in this system, and trying to put them in one database is the first instinct to override:
Small rows (a few KB each), billions of them, frequent reads, occasional updates (view count, likes). Goes in MySQL or Cassandra — Cassandra wins at our scale because of native sharding + multi-region replication.
Huge files (50 MB - several GB), written once, read many times, served by CDN. Goes in a distributed object store — HDFS, GlusterFS, or S3. Optimized for sequential reads and durability, not for low-latency lookups.
Tiny files (5-20 KB), many per video (one per playback resolution + scrubber preview frames), absurdly hot (every search result page loads dozens). Goes in a BigTable-style store tuned for small files + high read rate.
One Youtube page can render 20+ thumbnails in the search grid. At 46K views/sec, the thumbnail read rate is several hundred K/sec — far higher than video reads. A general-purpose object store (S3) is tuned for big files; small files there pay metadata overhead per request. BigTable / HBase coalesces millions of small files into wide-column rows with internal compression — orders of magnitude better small-file read throughput.
This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. Numbers from §3 drive every decision.
Sketch the simplest possible system. One app server. It accepts an upload, stores the MP4 on its local disk, and serves the file back when someone hits the play URL.
Three concrete failures emerge the moment a few thousand viewers show up:
A single server has roughly 10 Gbps of network egress — about 1.25 GB/s. We need 1 TB/s. To serve our actual playback load from one box you'd need 800 of these in parallel and a way to synchronize them — which is just "build a CDN", but worse.
If the only origin is in Virginia, every Mumbai viewer pays a 200 ms round-trip just for DNS and TCP setup before the first video byte arrives. Add TLS handshake and the player waits ~600 ms before showing frame 1. Users abandon at 2 seconds — we'd be losing them before video starts.
A single disk failure deletes every video on it. There are no backups, no replicas. A creator who uploaded their wedding 5 years ago opens a 404. The product's core promise — "your videos live forever" — dies in one hardware fault.
The most important insight in this design is that video has three completely different workloads happening simultaneously, and they must scale independently:
Slow ingest, durable. 230 uploads/sec, each potentially gigabytes. Latency budget: minutes (users tolerate "uploading… 47%" for a long file). Must be resumable (if their wifi blips at 80% we don't restart). Optimized for write throughput and durability.
Heavy async work. Transcode each upload into 5+ resolutions × 2-3 codecs = ~15 output variants. Generate thumbnails. Run dedup checks. CPU-bound (transcoding is expensive). Scales horizontally with a worker pool draining a queue.
Fast, massive egress. 1 TB/s sustained, sub-200ms first byte, geographically everywhere. Must be served from edge POPs not origin. This plane is dominated by CDN infrastructure; origin servers are only ever cache-fill sources.
The KEY insights this split surfaces:
Encoding a 10-minute 4K video into 5 lower resolutions takes several minutes of CPU even on a fast machine. We can't do it inline during upload — the request would time out. So uploads are accepted first, processed later, with state transitions visible to the user.
Bandwidth physics: no single origin can do petabit egress. We must push video bytes out to hundreds of POPs (Points of Presence) close to viewers. The CDN isn't a performance optimization — it's the only physically possible serving model.
Metadata is small, hot, queryable — it goes in Cassandra. Video bytes are huge, sequential, served via CDN — they go in object storage. Trying to share infrastructure between the two slows both down. Splitting them is mandatory, not optional.
Now the full picture. Every node is numbered ①–⑫ — find the matching card below to see what it does and crucially what would break without it.
Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.
Anything that uploads or plays a video — a desktop browser, a mobile app, a smart TV, a game console. The client is more sophisticated than in most systems: it runs an HLS/DASH player that watches its own bandwidth, fetches the next 10-second chunk at the right quality, and re-requests at lower quality if the buffer drains. The client is the first half of "adaptive bitrate streaming".
Solves: nothing on its own — but the client's intelligence is what makes the rest of the system feasible. By picking its own quality per chunk, the client lets us serve one set of pre-encoded variants and let the player negotiate. Without a smart client we'd have to do server-side bitrate switching, which can't scale.
The traffic cop in front of the application tier (upload service, search, comments). It distributes incoming HTTPS, terminates TLS, health-checks every backend every few seconds, and yanks unhealthy pods out of rotation. Note: the LB does NOT sit in front of video playback — viewers go directly to the CDN, not our origin. The LB only sees the small fraction of traffic that needs application logic.
Solves: single-server bottleneck and single-server failure for the application tier. Without an LB, one app crash takes down uploads, search, and comments. With it, we lose 1/N of capacity for a few seconds.
A stateless service whose only job is to ingest video bytes from creators. It supports resumable uploads via the tus protocol — the client sends the file in 5 MB chunks with a session ID, and if a chunk fails it just retries that chunk. Once all chunks are received, the service writes the full blob to Raw Video Storage ④, creates a metadata row in Video Metadata DB ⑨ with state PROCESSING, and drops a message onto the Processing Queue ⑤.
Solves: the slow-ingest problem. Mobile uploaders on 4G drop bytes constantly. A non-resumable upload would force them to restart from 0% on every flake — an unusable product. Resumable uploads + a dedicated upload tier (separated from the rest of the API) means no single user's 1 GB upload can starve the read tier of CPU.
A durable object store that holds the original uploaded blob exactly as the creator sent it — pre-transcode, full quality. S3-class storage replicates to 3+ availability zones automatically; once a write returns OK, the bytes are safe even if a whole zone burns down. We keep the raw forever (or for some retention period) so we can re-transcode in the future when better codecs land (AV1 today, something else tomorrow).
Solves: the "never lose a video" reliability requirement. If we transcoded directly into final variants and discarded the raw, a future codec improvement would force creators to re-upload — broken trust. Keeping the raw also makes transcode failures recoverable: if a worker crashes mid-encode we just retry from the source.
A durable, partitioned message queue. The Upload Service drops a job per video — {video_id, raw_blob_url, target_resolutions, target_codecs} — and Transcoding Workers consume the queue. Kafka's partitioning gives us parallelism: 100 partitions = up to 100 workers consuming in parallel without coordination. Messages survive worker crashes (acked only after a worker completes the transcode), so a worker dying mid-job means the next one picks the work up.
Solves: decoupling upload speed from transcode speed. A spike of 10× normal uploads (think a Coachella weekend) doesn't break uploads — it just lets the queue grow, and workers drain it as fast as they can. Without a queue, every upload would hang until its transcode finished, and the upload service would melt under back-pressure.
A horizontally-scaled fleet of CPU-heavy workers. Each one pulls a job from the queue, downloads the raw blob from ④, and runs ffmpeg (or hardware-accelerated equivalents) to produce: 240p, 480p, 720p, 1080p, 4K variants × H.264, VP9, AV1 codecs — roughly 15 output files. It also extracts thumbnail frames at fixed intervals. Outputs are written to Encoded Video Storage ⑦ and Thumbnail Storage ⑧, then the worker updates the video state in Metadata DB ⑨ from PROCESSING to READY.
Solves: the "play smoothly on any device, any bandwidth" requirement. A single 1080p file would be unwatchable on a slow phone (rebuffers constantly) and bandwidth-wasting on a TV that can do 4K. Pre-transcoding into a ladder of resolutions lets the player pick what fits — adaptive bitrate streaming literally cannot exist without this stage.
An object store holding the output of transcoding — multiple resolution variants per video, each chunked into 10-second fragments for HLS/DASH. This is the bucket the CDN pulls from on cache misses. Storage layout: video_id/720p/h264/segment_0001.ts, segment_0002.ts, .... Replicated across regions so any CDN POP worldwide can pull a missing chunk from a nearby replica without a transatlantic round-trip.
Solves: being the CDN origin without being the bottleneck. Because the CDN absorbs ~95%+ of viewer traffic, this store only sees cache-fill reads — a tiny fraction of the 1 TB/s playback load. Cross-region replication ensures even cache fills are local.
A wide-column store optimized for billions of small files (5-20 KB each). Each video produces dozens of thumbnails: one main thumb + scrubber-bar previews + per-resolution variants. Read rate is much higher than video reads (every search result page loads 20+ thumbnails). BigTable coalesces millions of small files into compressed wide rows, giving us 5 ms p99 reads at any scale.
Solves: the small-file read storm. If thumbnails lived in S3 alongside videos, every page load would issue 20+ S3 GETs — billable, slow, and rate-limited by S3's per-prefix limits. BigTable is purpose-built for this access pattern.
The source of truth for everything-but-bytes about each video: title, description, tags, uploader, duration, view count, like count, dislike count, current state (PROCESSING / READY / FAILED), thumbnail ID, list of encoded-variant URLs. Sharded by video_id via consistent hashing, replicated 3× across availability zones. Cassandra's quorum reads/writes (R=2, W=2, N=3) give us strong-enough consistency without sacrificing availability.
Solves: serving the small-row, high-rate metadata workload that doesn't fit on a single MySQL box. View counts are write-heavy on the hottest videos; Cassandra's tunable consistency lets us accept "eventually consistent view counts" (a viewer in Tokyo seeing 1,000,002 while NYC sees 1,000,005 is fine) which lets writes scale linearly.
An inverted-index search engine. Every time the metadata DB writes a new video (state transitions to READY), a CDC stream feeds the title, description, and tags into Elasticsearch. Search queries hit ES, which returns ranked video_ids in milliseconds; the API then enriches those IDs with the latest metadata from Cassandra.
Solves: "find videos about X" at scale. Cassandra cannot answer keyword queries efficiently — it's a key-value store, not a search engine. Without ES, every search would be a full-table scan — impossible at billions of rows. ES handles fuzzy matching, ranking, faceting, and language analysis natively.
A globally-distributed network of edge POPs in 200+ cities. Each POP caches the most-watched video chunks for its region. When Raj in Mumbai requests video_id/720p/segment_42.ts, his player connects to a Mumbai-area POP — which (most likely) already has the chunk and serves it in 20 ms. On a cache miss, the POP fetches from the nearest Encoded Video Storage ⑦ replica, populates its cache, and serves the viewer. Once a chunk is in the POP, every subsequent Mumbai viewer of that video gets it from cache.
Solves: the 1 TB/s egress problem. Without the CDN, our origin would need petabit egress — physically impossible at any cost. With the CDN, origin egress is roughly the cache-fill rate (a few percent of total) — manageable on commodity infrastructure. The CDN is also the latency story: a Mumbai POP is 20 ms away, our US origin is 200 ms.
A separate microservice with its own data store. Comments are append-mostly with low write rate per video; likes/dislikes are atomic counter increments. Likes update the metadata DB's counter via async aggregation (the same pattern as URL-shortener click counts) — never an inline UPDATE on the video row, which would melt under viral load.
Solves: isolating a noisy social workload from the playback critical path. A viral video getting 100K likes/sec must not slow down its own playback. Putting comments & likes in their own service with their own store + async aggregation keeps the core video_id → variants path fast.
Two real flows, mapped to the numbered components above:
raw/v_8f2a1c.mp4.{video_id: v_8f2a1c, title: "Goa Travel Vlog", state: PROCESSING} — and replies 202 Accepted to Priya. Total elapsed: ~12 minutes (mostly upload bandwidth).ffmpeg to produce 240p/480p/720p/1080p/4K × H.264/VP9/AV1, plus 8 thumbnail frames. Outputs go to Encoded Video Storage ⑦ and Thumbnail Storage ⑧.PROCESSING to READY. CDC stream propagates the new title/tags into Search Index ⑩. Total elapsed: ~25 minutes total. Video is now playable worldwide.video_id v_8f2a1c in 30 ms..m3u8) — a tiny text file listing chunk URLs at all available qualities. Manifest is served by CDN ⑪ from the Mumbai POP.v_8f2a1c/720p/h264/segment_0001.ts from CDN.The video metadata DB will hold billions of rows and serve hundreds of thousands of reads per second. We must shard it across many nodes — but which column do we shard on? That choice has consequences for years.
"All videos uploaded by user X live on shard hash(X) % N." Easy: every uploader's catalog is co-located, perfect for "show me my videos" queries.
The fatal flaw — hot creators. A megastar uploader with 100M subscribers has every one of their videos on the same shard. When their new video drops, that single shard takes all the read traffic. The shard melts; every other user on it goes down too. This is the textbook celebrity hot-shard problem.
"Each video lives on shard consistent_hash(video_id)." Each new video lands on a uniformly-random shard, regardless of who uploaded it. Even MrBeast's videos are spread across all shards.
The trade-off: "show me all videos by user X" can no longer be answered by hitting one shard — it must scatter-gather across all shards (or be answered by a separate index/projection). For a search-heavy product like Youtube, that's a fine trade — search already hits Elasticsearch, not the metadata DB directly.
If video_id is the shard key, it had better be globally unique and well-distributed. We generate it via a separate ID Generation Service (similar pattern to the URL shortener's KGS) that produces 64-bit IDs by combining: timestamp (41 bits) + machine_id (10 bits) + sequence (12 bits) — Snowflake-style. Embedding the timestamp lets us also use the ID for time-range queries ("videos uploaded last hour") without an index scan.
Many uploaders submit the same video — movie clips ripped from the same source, viral memes shared by ten thousand reposters, news footage everyone re-uploads. Without dedup, we'd transcode and store the same bytes thousands of times: wasted storage, wasted CPU, wasted CDN bandwidth. Inline dedup at upload time is the single biggest cost saver in the system.
As bytes arrive at the Upload Service, we compute a SHA-256 of the file contents. Before transcoding, we look up the hash in a dedup index: if a video with this exact hash already exists, we skip transcoding entirely and just point the new metadata row at the existing variants. Zero storage, zero CPU.
Catches: bit-for-bit identical re-uploads (the same WhatsApp meme passed around 10K times). Misses: anything re-encoded — a different bitrate, a watermark, even a 1-byte metadata change.
For each video we compute a perceptual hash — a fingerprint based on visual features (luminance histograms, motion vectors) that's stable under re-encoding, watermarking, and small crops. Sample frames every few seconds and fingerprint each one. Compare against known fingerprints with a similarity threshold (e.g., 95%). If matched, we link to the existing variants and only store the new metadata.
Catches: the same movie clip uploaded at different bitrates, with different watermarks, with different intro/outro crops. Cost: CPU per upload to fingerprint, plus an approximate-search index (FAISS, ScaNN). Worth every cycle: at our scale this saves petabytes.
This is the same conceptual approach Dropbox uses for file dedup, only adapted for video — a known viral clip uploaded by 10,000 different users gets transcoded once, stored once, served from one set of CDN-cached variants for all 10,000 metadata pointers.
Load balancers sit at three layers in our system, and each plays a different role.
Public-facing LB (AWS ALB / nginx). Distributes incoming HTTPS for upload, search, comments, and metadata APIs across the application pods. Health-checks every 5s, evicts unhealthy pods. Terminates TLS so backends don't pay the crypto cost.
DNS-level steering routes the viewer's player to the nearest CDN POP based on geo-IP. Inside a POP, requests are balanced across cache nodes using consistent hashing on the chunk URL — so the same chunk always hits the same cache node, maximizing hit rate.
Cassandra and Memcached drivers do this themselves — clients know the cluster topology and route to the right shard's coordinator. For read-heavy paths the driver also balances across replicas to avoid hammering the primary.
The CDN's job is to keep the most-watched chunks in cache. If incoming chunk requests are randomly assigned to cache nodes inside a POP (round-robin), the same chunk ends up cached on every node — wasting cache capacity. With consistent hashing on the chunk URL, every cache node holds a disjoint slice of the keyspace, multiplying effective cache size by the node count.
Smart LBs at the CDN level also use predictive analytics: if the Mumbai POP is approaching saturation (say a viral cricket match clip), DNS steering temporarily redirects new viewers to the Singapore POP — slightly higher latency, but no rebuffering. The viewer sees a smooth experience even as the network reshapes itself.
Caching at Youtube has two completely different jobs: caching small hot metadata (titles, view counts, search results) and caching huge video chunks at the edge. Both follow the 80/20 rule — a small fraction of content drives most reads — but they live on different infrastructure.
Each CDN POP holds a few TB of the most-watched chunks for that region. A trending music video in Brazil sits in São Paulo POPs but not Tokyo POPs. Latency: ~20 ms. Eviction: LRU.
Per-region "shield" cache that backs many edge POPs. If a chunk misses at an edge POP, the POP first asks the regional shield before going to origin. Catches the long tail that any one POP doesn't have room for. Latency: ~40 ms.
Encoded Video Storage ⑦ itself — multi-region S3. Hit rarely (only on cold-tail content or new videos before they're cached anywhere). Latency: ~100 ms. Sized for total catalog, not for traffic.
The metadata DB is fronted by a Memcached/Redis cluster. The hottest 20% of video_id → metadata rows fit in tens of GB of RAM — easily holding millions of trending videos. Read flow: app server GET video_id → cache hit returns in microseconds; cache miss falls through to Cassandra and back-fills.
Eviction is LRU — newly trending videos push old ones out naturally. Search result snippets are also cached: "goa travel" → [video_id_1, video_id_2, ...] with a 60-second TTL, so a viral search query (the next World Cup final highlights) hits memory rather than Elasticsearch.
The CDN is the single most important piece of infrastructure in any video service. Without a CDN, this whole design collapses — there's literally no way to serve 1 TB/s of egress from a single origin. With a CDN, viewer traffic terminates at edge POPs near the viewer, and our origin only sees a tiny cache-fill trickle.
The CDN doesn't just serve files — it serves the adaptive streaming protocol. Each video is sliced into 10-second chunks, encoded at multiple bitrates. The player downloads a small manifest (HLS .m3u8 or DASH .mpd) listing all the chunk URLs at all qualities, then:
At our scale, failures aren't exceptions — they're a constant background. A disk dies somewhere every few minutes; a whole AZ goes dark every few months. The system must absorb these without users noticing.
Cassandra's ring uses consistent hashing — when a node dies, only the keys that mapped to it need to be recovered (from replicas), and only the next node on the ring picks up the new write traffic. No global rebalancing. When we add a node to scale up, only 1/N of keys move. Compared to hash % N, this is the difference between a few hours of background work and a multi-day cluster-wide migration.
Both Raw Video Storage ④ and Encoded Video Storage ⑦ replicate every blob to 3 availability zones (S3 does this automatically). A whole zone burning down doesn't lose a single video. For cross-region disasters, we replicate hot/critical content to 2+ regions; cold archive content lives in one region with cross-region backup.
If the origin Encoded Storage in US-East goes offline, viewers don't notice — the CDN keeps serving cached chunks for hours from edge POPs. Cache TTLs of several hours mean even a multi-hour origin outage is invisible to viewers of currently-trending content. Only cold-tail viewers (first-ever play of an obscure video) see errors during the outage.
Each transcode job in Kafka has a unique video_id + variant_spec key. If a worker crashes mid-encode, the job is re-delivered to another worker — which simply overwrites any partial output with a clean encode. No coordination needed. Workers themselves are stateless and disposable; we run on spot instances and tolerate kills.
.m3u8 / DASH .mpd) listing all chunk URLs at all qualities. (3) The player measures its bandwidth on each chunk, picks the highest bitrate that won't drain its buffer, and requests the next chunk at that bitrate. Quality changes are seamless because chunks align across bitrates — same time offset, just different file. The viewer sees video that "magically" matches their network.