Youtube / Netflix — Video Streaming HLD Design

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →

Step 1

What is Youtube?

Imagine Priya, a creator in Bangalore, finishes editing a 1GB 4K travel vlog at 14:00 IST. She hits "Upload". Twelve hours later, Raj — a viewer in Mumbai sitting on a wobbly 6 Mbps mobile connection — taps her thumbnail. He expects the video to start playing in under a second, at the highest quality his bandwidth can sustain, with zero rebuffering. Those are wildly different bytes traveling wildly different paths: Priya's pristine 4K source has to be sliced, transcoded into half a dozen resolutions, replicated worldwide, and cached at the edge — so that by the time Raj presses play, there's a 720p version sitting on a server 30 km from his phone, ready to stream chunk by chunk.

That entire journey — from one creator's "Upload" click to a billion viewers tapping "Play" — is what we're designing. Youtube and Netflix differ in catalog (user-generated vs. licensed) but the streaming spine is identical: ingest → transcode → store → serve from edge.

The two questions that drive every design decision below: (1) How do we ingest 500 hours of new video every minute and turn each upload into 5+ resolution variants without a backlog? (2) How do we serve a sustained ~1 TB/sec of outgoing video bytes when no single data center on Earth has the egress bandwidth to do it from one origin?

Step 2

Requirements & Goals

Before drawing a single box, lock down what the system must do. In an interview, asking these questions out loud shows you're not memory-dumping a solution — you're scoping the problem.

✅ Functional Requirements

Upload videos (any size, any common format)
View & share videos with anyone
Search videos by title, description, tags
Record view counts, likes & dislikes
Allow viewers to add & view comments

⚙️ Non-Functional Requirements

Highly reliable — once a video is uploaded, it must never be lost
Highly available — eventual consistency is acceptable for view counts & likes (a viewer in Tokyo seeing 1,000,002 views while one in NYC sees 1,000,005 is fine)
Real-time playback — viewers expect zero lag, sub-second start, no buffering on a stable connection

🚫 Out of Scope

Recommendations — the "Up Next" panel and personalized homepage
Channels & subscriptions — creator following, notification fan-out
Live streaming — different ingest path entirely (we'll touch on it in Q&A)
Monetization & ads — separate ad-serving system

The non-functional requirements are the harder ones. Storing one MP4 is trivial. Storing 25GB/sec of new uploads while simultaneously serving 1TB/sec of outgoing playback to viewers in 200 countries — that's the part that requires architecture.

Step 3

Capacity Estimation & Constraints

Numbers are not optional in HLD. They drive every architectural choice — sharding, caching, CDN sizing — so do them out loud, even if rough. Video streaming is wildly read-heavy: assume a 200:1 view-to-upload ratio.

Traffic estimates

Assume 1.5B total users, 800M daily active users, each watching 5 videos/day. That's 4 billion video views/day — roughly 46K views/sec. With a 1:200 upload-to-view ratio, we get ~230 uploads/sec.

Daily Active Users

800 M

Out of 1.5B total

Views/sec

~46 K

800M × 5 / 86400

Uploads/sec

~230

1:200 view ratio

Hours/min uploaded

~500 hrs

Real-world Youtube number

Storage estimate

If 500 hours of video are uploaded per minute, and we assume an average bitrate of ~50 MB/min (a mid-range encode for 1080p), that's a raw ingest of:

500 hrs × 60 min × 50 MB = 1,500,000 MB/min ≈ 25 GB/sec of raw upload bytes.

But we don't store just the original — we transcode each upload into 5 resolution variants (240p, 480p, 720p, 1080p, 4K) and 2-3 codecs (H.264, VP9, AV1). After dedup and smarter encoding, total stored bytes per minute of source video roughly 2-3× the raw size. Over a year that's hundreds of petabytes — a number only large object stores can hold.

Bandwidth estimate (the scary one)

Inbound (uploads): 230 × ~25 MB avg = ~5 GB/sec ingress. That's manageable.

Outbound (playback): with a 200:1 view-to-upload ratio, egress is 200× ingress: ~1 TB/sec sustained outgoing. That number is the single most important constraint in the design — and it explains why CDN is non-negotiable, not optional.

Metric	Value	Why it matters
Views/sec	`46 K/s`	Drives CDN edge fleet sizing
Uploads/sec	`230/s`	Drives upload service & transcode farm sizing
Raw ingest	`25 GB/s`	Forces distributed object store, not a SAN
Egress	`~1 TB/s`	Physically requires a CDN — no origin can do this
Hot cache	`tens of TB`	Justifies multi-tier cache: edge → regional → origin

Why 1 TB/s is the design's center of gravity: a top-of-rack 100 Gbps switch saturates at ~12.5 GB/s. To serve 1 TB/s from one origin you'd need 80 of them in parallel, plus power, plus cross-continent fiber the whole way to every viewer. That's why every byte of video must go through a CDN. The origin never sees viewer traffic in steady state — only cache fills.

Step 4

System APIs

Three endpoints carry the system: upload a video, search the catalog, and stream a video. The third is the unusual one — it returns a stream of bytes, not JSON.

REST API surface

// Upload — slow, large payload, resumable
POST /api/v1/videos
Headers: { "X-API-Key": "abc123..." }
Body (multipart):
{
  "title":        "Goa Travel Vlog 2026",
  "description":  "Sunsets, beaches, and street food",
  "tags":         ["travel","goa","vlog","2026"],
  "category":     "Travel",
  "language":     "en",
  "video_blob":   <binary 1 GB MP4>
}
→ 202 Accepted  { "video_id": "v_8f2a1c", "state": "PROCESSING" }

// Search — keyword query against the catalog
GET /api/v1/search?q=goa+sunset&count=20&page_token=...
→ 200 OK
{
  "results": [
    { "video_id": "v_8f2a1c", "title": "Goa Travel Vlog 2026",
      "thumbnail_url": "https://cdn.../thumb_8f2a1c.jpg",
      "duration_sec": 612, "views": 1250 },
    ...
  ],
  "next_page_token": "eyJ..."
}

// Stream — returns BYTES not JSON; HLS/DASH manifest + chunked video
GET /api/v1/videos/:video_id/stream?codec=h264&resolution=720p&offset=0
→ 200 OK
Content-Type: application/vnd.apple.mpegurl   // HLS manifest
   // or video/mp4 for a single chunk
   // CDN serves this; origin never sees it in steady state

Why streamVideo returns a STREAM not JSON: a 1080p video is hundreds of MB. We don't ship it in one response — we ship a tiny manifest (HLS .m3u8 or DASH .mpd) that lists 10-second chunk URLs. The player downloads chunks on demand, picks resolution per chunk based on current bandwidth (adaptive bitrate), and buffers ~30 seconds ahead. This is what makes "real-time playback" possible — the first chunk is served in 200ms while the rest is fetched in the background.

Abuse prevention via X-API-Key: upload is expensive (storage + transcode + CDN). Tie every upload to an authenticated user and rate-limit per user (e.g., 10 uploads/hour for free tier). Search is rate-limited per IP to deter scrapers building competitor catalogs.

Step 5

Database Schema

Three completely different data shapes live in this system, and trying to put them in one database is the first instinct to override:

📋 Metadata

Small rows (a few KB each), billions of them, frequent reads, occasional updates (view count, likes). Goes in MySQL or Cassandra — Cassandra wins at our scale because of native sharding + multi-region replication.

🎬 Video Blobs

Huge files (50 MB - several GB), written once, read many times, served by CDN. Goes in a distributed object store — HDFS, GlusterFS, or S3. Optimized for sequential reads and durability, not for low-latency lookups.

🖼️ Thumbnails

Tiny files (5-20 KB), many per video (one per playback resolution + scrubber preview frames), absurdly hot (every search result page loads dozens). Goes in a BigTable-style store tuned for small files + high read rate.

erDiagram USER { bigint user_id PK string name string email timestamp created_at timestamp last_login } VIDEO { string video_id PK string title string description bigint size_bytes int duration_sec string thumbnail_id bigint uploader_user_id FK bigint likes bigint dislikes bigint views timestamp created_at string state } COMMENT { bigint comment_id PK string video_id FK bigint user_id FK string text timestamp created_at } VIDEO_VARIANT { string variant_id PK string video_id FK string resolution string codec string blob_url bigint size_bytes } USER ||--o{ VIDEO : "uploads" USER ||--o{ COMMENT : "writes" VIDEO ||--o{ COMMENT : "has" VIDEO ||--o{ VIDEO_VARIANT : "transcoded into"

Why thumbnails get their own store

One Youtube page can render 20+ thumbnails in the search grid. At 46K views/sec, the thumbnail read rate is several hundred K/sec — far higher than video reads. A general-purpose object store (S3) is tuned for big files; small files there pay metadata overhead per request. BigTable / HBase coalesces millions of small files into wide-column rows with internal compression — orders of magnitude better small-file read throughput.

Before / after of splitting storage by data shape: putting everything in one MySQL DB → search queries compete with view-count writes for the same lock table; thumbnail reads compete with video metadata for IO. After splitting → metadata sits in Cassandra (45 ms p99), video blobs in S3 (no DB hit at all, served by CDN), thumbnails in BigTable (5 ms p99). Each tier scales on its own clock.

Step 6 · CORE

High-Level Architecture — From Naive to Production

This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. Numbers from §3 drive every decision.

Pass 1 — The naive design (and why it breaks)

Sketch the simplest possible system. One app server. It accepts an upload, stores the MP4 on its local disk, and serves the file back when someone hits the play URL.

flowchart LR C["Client"] --> APP["App Server"] APP --> DISK[("Local Disk — MP4 files")] APP --> META[("MySQL — metadata")]

Three concrete failures emerge the moment a few thousand viewers show up:

💥 One disk, one bandwidth pipe

A single server has roughly 10 Gbps of network egress — about 1.25 GB/s. We need 1 TB/s. To serve our actual playback load from one box you'd need 800 of these in parallel and a way to synchronize them — which is just "build a CDN", but worse.

💥 Mumbai viewer, US server, 200 ms RTT

If the only origin is in Virginia, every Mumbai viewer pays a 200 ms round-trip just for DNS and TCP setup before the first video byte arrives. Add TLS handshake and the player waits ~600 ms before showing frame 1. Users abandon at 2 seconds — we'd be losing them before video starts.

💥 One server crash = videos gone forever

A single disk failure deletes every video on it. There are no backups, no replicas. A creator who uploaded their wedding 5 years ago opens a 404. The product's core promise — "your videos live forever" — dies in one hardware fault.

Pass 2 — The mental model: three planes (Upload / Process / Serve)

The most important insight in this design is that video has three completely different workloads happening simultaneously, and they must scale independently:

⬆️ Upload Plane

Slow ingest, durable. 230 uploads/sec, each potentially gigabytes. Latency budget: minutes (users tolerate "uploading… 47%" for a long file). Must be resumable (if their wifi blips at 80% we don't restart). Optimized for write throughput and durability.

⚙️ Processing Plane

Heavy async work. Transcode each upload into 5+ resolutions × 2-3 codecs = ~15 output variants. Generate thumbnails. Run dedup checks. CPU-bound (transcoding is expensive). Scales horizontally with a worker pool draining a queue.

⬇️ Serve Plane

Fast, massive egress. 1 TB/s sustained, sub-200ms first byte, geographically everywhere. Must be served from edge POPs not origin. This plane is dominated by CDN infrastructure; origin servers are only ever cache-fill sources.

The KEY insights this split surfaces:

🎬 Transcoding must be async

Encoding a 10-minute 4K video into 5 lower resolutions takes several minutes of CPU even on a fast machine. We can't do it inline during upload — the request would time out. So uploads are accepted first, processed later, with state transitions visible to the user.

🌍 Serving must happen at the edge

Bandwidth physics: no single origin can do petabit egress. We must push video bytes out to hundreds of POPs (Points of Presence) close to viewers. The CDN isn't a performance optimization — it's the only physically possible serving model.

📋 Metadata and blobs live separately

Metadata is small, hot, queryable — it goes in Cassandra. Video bytes are huge, sequential, served via CDN — they go in object storage. Trying to share infrastructure between the two slows both down. Splitting them is mandatory, not optional.

Pass 3 — The production shape

Now the full picture. Every node is numbered ①–⑫ — find the matching card below to see what it does and crucially what would break without it.

flowchart TB CL["① Client — Browser, Mobile, Smart TV"] subgraph EDGE["Edge"] LB["② Load Balancer"] CDN["⑪ CDN — Akamai / CloudFront"] end subgraph UPLOAD["Upload Plane"] UP["③ Upload Service"] RAW[("④ Raw Video Storage — S3")] end subgraph PROCESS["Processing Plane"] Q["⑤ Processing Queue — Kafka"] TR["⑥ Transcoding Workers"] ENC[("⑦ Encoded Video Storage — S3")] THUMB[("⑧ Thumbnail Storage — BigTable")] end subgraph SERVE["Serve Plane"] META[("⑨ Video Metadata DB — Cassandra")] SEARCH[("⑩ Search Index — Elasticsearch")] CMT["⑫ Comment and Like Service"] end CL -->|"upload + search"| LB CL -->|"playback"| CDN LB --> UP LB --> SEARCH LB --> CMT UP --> RAW UP --> META RAW --> Q Q --> TR TR --> ENC TR --> THUMB TR --> META META --> SEARCH CDN -.cache miss.-> ENC CDN -.thumbnails.-> THUMB CMT --> META style CL fill:#e8743b,stroke:#e8743b,color:#fff style LB fill:#171d27,stroke:#9b72cf,color:#d4dae5 style UP fill:#171d27,stroke:#e8743b,color:#d4dae5 style RAW fill:#171d27,stroke:#e8743b,color:#d4dae5 style Q fill:#171d27,stroke:#9b72cf,color:#d4dae5 style TR fill:#171d27,stroke:#9b72cf,color:#d4dae5 style ENC fill:#171d27,stroke:#38b265,color:#d4dae5 style THUMB fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style META fill:#171d27,stroke:#4a90d9,color:#d4dae5 style SEARCH fill:#171d27,stroke:#d4a838,color:#d4dae5 style CDN fill:#171d27,stroke:#38b265,color:#d4dae5 style CMT fill:#171d27,stroke:#e05252,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.

① Client

Anything that uploads or plays a video — a desktop browser, a mobile app, a smart TV, a game console. The client is more sophisticated than in most systems: it runs an HLS/DASH player that watches its own bandwidth, fetches the next 10-second chunk at the right quality, and re-requests at lower quality if the buffer drains. The client is the first half of "adaptive bitrate streaming".

Solves: nothing on its own — but the client's intelligence is what makes the rest of the system feasible. By picking its own quality per chunk, the client lets us serve one set of pre-encoded variants and let the player negotiate. Without a smart client we'd have to do server-side bitrate switching, which can't scale.

② Load Balancer

The traffic cop in front of the application tier (upload service, search, comments). It distributes incoming HTTPS, terminates TLS, health-checks every backend every few seconds, and yanks unhealthy pods out of rotation. Note: the LB does NOT sit in front of video playback — viewers go directly to the CDN, not our origin. The LB only sees the small fraction of traffic that needs application logic.

Solves: single-server bottleneck and single-server failure for the application tier. Without an LB, one app crash takes down uploads, search, and comments. With it, we lose 1/N of capacity for a few seconds.

③ Upload Service

A stateless service whose only job is to ingest video bytes from creators. It supports resumable uploads via the tus protocol — the client sends the file in 5 MB chunks with a session ID, and if a chunk fails it just retries that chunk. Once all chunks are received, the service writes the full blob to Raw Video Storage ④, creates a metadata row in Video Metadata DB ⑨ with state PROCESSING, and drops a message onto the Processing Queue ⑤.

Solves: the slow-ingest problem. Mobile uploaders on 4G drop bytes constantly. A non-resumable upload would force them to restart from 0% on every flake — an unusable product. Resumable uploads + a dedicated upload tier (separated from the rest of the API) means no single user's 1 GB upload can starve the read tier of CPU.

④ Raw Video Storage (S3)

A durable object store that holds the original uploaded blob exactly as the creator sent it — pre-transcode, full quality. S3-class storage replicates to 3+ availability zones automatically; once a write returns OK, the bytes are safe even if a whole zone burns down. We keep the raw forever (or for some retention period) so we can re-transcode in the future when better codecs land (AV1 today, something else tomorrow).

Solves: the "never lose a video" reliability requirement. If we transcoded directly into final variants and discarded the raw, a future codec improvement would force creators to re-upload — broken trust. Keeping the raw also makes transcode failures recoverable: if a worker crashes mid-encode we just retry from the source.

⑤ Processing Queue (Kafka)

A durable, partitioned message queue. The Upload Service drops a job per video — {video_id, raw_blob_url, target_resolutions, target_codecs} — and Transcoding Workers consume the queue. Kafka's partitioning gives us parallelism: 100 partitions = up to 100 workers consuming in parallel without coordination. Messages survive worker crashes (acked only after a worker completes the transcode), so a worker dying mid-job means the next one picks the work up.

Solves: decoupling upload speed from transcode speed. A spike of 10× normal uploads (think a Coachella weekend) doesn't break uploads — it just lets the queue grow, and workers drain it as fast as they can. Without a queue, every upload would hang until its transcode finished, and the upload service would melt under back-pressure.

⑥ Transcoding Workers

A horizontally-scaled fleet of CPU-heavy workers. Each one pulls a job from the queue, downloads the raw blob from ④, and runs ffmpeg (or hardware-accelerated equivalents) to produce: 240p, 480p, 720p, 1080p, 4K variants × H.264, VP9, AV1 codecs — roughly 15 output files. It also extracts thumbnail frames at fixed intervals. Outputs are written to Encoded Video Storage ⑦ and Thumbnail Storage ⑧, then the worker updates the video state in Metadata DB ⑨ from PROCESSING to READY.

Solves: the "play smoothly on any device, any bandwidth" requirement. A single 1080p file would be unwatchable on a slow phone (rebuffers constantly) and bandwidth-wasting on a TV that can do 4K. Pre-transcoding into a ladder of resolutions lets the player pick what fits — adaptive bitrate streaming literally cannot exist without this stage.

⑦ Encoded Video Storage (S3)

An object store holding the output of transcoding — multiple resolution variants per video, each chunked into 10-second fragments for HLS/DASH. This is the bucket the CDN pulls from on cache misses. Storage layout: video_id/720p/h264/segment_0001.ts, segment_0002.ts, .... Replicated across regions so any CDN POP worldwide can pull a missing chunk from a nearby replica without a transatlantic round-trip.

Solves: being the CDN origin without being the bottleneck. Because the CDN absorbs ~95%+ of viewer traffic, this store only sees cache-fill reads — a tiny fraction of the 1 TB/s playback load. Cross-region replication ensures even cache fills are local.

⑧ Thumbnail Storage (BigTable)

A wide-column store optimized for billions of small files (5-20 KB each). Each video produces dozens of thumbnails: one main thumb + scrubber-bar previews + per-resolution variants. Read rate is much higher than video reads (every search result page loads 20+ thumbnails). BigTable coalesces millions of small files into compressed wide rows, giving us 5 ms p99 reads at any scale.

Solves: the small-file read storm. If thumbnails lived in S3 alongside videos, every page load would issue 20+ S3 GETs — billable, slow, and rate-limited by S3's per-prefix limits. BigTable is purpose-built for this access pattern.

⑨ Video Metadata DB (Cassandra)

The source of truth for everything-but-bytes about each video: title, description, tags, uploader, duration, view count, like count, dislike count, current state (PROCESSING / READY / FAILED), thumbnail ID, list of encoded-variant URLs. Sharded by video_id via consistent hashing, replicated 3× across availability zones. Cassandra's quorum reads/writes (R=2, W=2, N=3) give us strong-enough consistency without sacrificing availability.

Solves: serving the small-row, high-rate metadata workload that doesn't fit on a single MySQL box. View counts are write-heavy on the hottest videos; Cassandra's tunable consistency lets us accept "eventually consistent view counts" (a viewer in Tokyo seeing 1,000,002 while NYC sees 1,000,005 is fine) which lets writes scale linearly.

⑩ Search Index (Elasticsearch)

An inverted-index search engine. Every time the metadata DB writes a new video (state transitions to READY), a CDC stream feeds the title, description, and tags into Elasticsearch. Search queries hit ES, which returns ranked video_ids in milliseconds; the API then enriches those IDs with the latest metadata from Cassandra.

Solves: "find videos about X" at scale. Cassandra cannot answer keyword queries efficiently — it's a key-value store, not a search engine. Without ES, every search would be a full-table scan — impossible at billions of rows. ES handles fuzzy matching, ranking, faceting, and language analysis natively.

⑪ CDN (Akamai / CloudFront)

A globally-distributed network of edge POPs in 200+ cities. Each POP caches the most-watched video chunks for its region. When Raj in Mumbai requests video_id/720p/segment_42.ts, his player connects to a Mumbai-area POP — which (most likely) already has the chunk and serves it in 20 ms. On a cache miss, the POP fetches from the nearest Encoded Video Storage ⑦ replica, populates its cache, and serves the viewer. Once a chunk is in the POP, every subsequent Mumbai viewer of that video gets it from cache.

Solves: the 1 TB/s egress problem. Without the CDN, our origin would need petabit egress — physically impossible at any cost. With the CDN, origin egress is roughly the cache-fill rate (a few percent of total) — manageable on commodity infrastructure. The CDN is also the latency story: a Mumbai POP is 20 ms away, our US origin is 200 ms.

⑫ Comment & Like Service

A separate microservice with its own data store. Comments are append-mostly with low write rate per video; likes/dislikes are atomic counter increments. Likes update the metadata DB's counter via async aggregation (the same pattern as URL-shortener click counts) — never an inline UPDATE on the video row, which would melt under viral load.

Solves: isolating a noisy social workload from the playback critical path. A viral video getting 100K likes/sec must not slow down its own playback. Putting comments & likes in their own service with their own store + async aggregation keeps the core video_id → variants path fast.

Concrete walkthrough — Priya uploads, Raj plays

Two real flows, mapped to the numbered components above:

⬆️ Upload flow — Priya posts her vlog at 14:00 IST

Priya's Client ① splits her 1 GB MP4 into 5 MB chunks and POSTs them to Upload Service ③ via the Load Balancer ②.
Each chunk is appended to her resumable session. When the last chunk lands, Upload Service writes the full blob into Raw Video Storage ④ as raw/v_8f2a1c.mp4.
Upload Service creates a row in Video Metadata DB ⑨ — {video_id: v_8f2a1c, title: "Goa Travel Vlog", state: PROCESSING} — and replies 202 Accepted to Priya. Total elapsed: ~12 minutes (mostly upload bandwidth).
Upload Service publishes a job to Processing Queue ⑤. A free Transcoding Worker ⑥ picks it up.
Worker downloads the raw blob, runs ffmpeg to produce 240p/480p/720p/1080p/4K × H.264/VP9/AV1, plus 8 thumbnail frames. Outputs go to Encoded Video Storage ⑦ and Thumbnail Storage ⑧.
Worker flips the metadata state from PROCESSING to READY. CDC stream propagates the new title/tags into Search Index ⑩. Total elapsed: ~25 minutes total. Video is now playable worldwide.

▶️ Playback flow — Raj in Mumbai presses Play at 14:30 IST

Raj's Client ① searches "goa travel" — request goes through LB ② to Search Index ⑩, which returns video_id v_8f2a1c in 30 ms.
Client fetches the HLS manifest (.m3u8) — a tiny text file listing chunk URLs at all available qualities. Manifest is served by CDN ⑪ from the Mumbai POP.
Client measures Raj's bandwidth (~6 Mbps), picks 720p, and requests v_8f2a1c/720p/h264/segment_0001.ts from CDN.
CDN cache hit? The video is hot in Mumbai (other viewers watched it earlier). Chunk returns in ~30 ms. Player starts decoding. First frame on screen: ~200 ms after tap.
Client buffers the next ~30 seconds in the background. If Raj's bandwidth drops to 2 Mbps mid-watch, the next chunk request switches to 480p — same video, same offset, lower quality. Adaptive bitrate in action.
Player asynchronously POSTs a "view" event. Comment & Like Service ⑫ aggregates it; the metadata DB's view counter is updated in batches, not per-event.

So what: Youtube's architecture is built around three insights — (1) uploads are async pipelines, not request/response, so transcoding never blocks ingest; (2) video bytes never travel through application servers — they live in object storage and are served by the CDN, which keeps origin cost flat as viewer traffic explodes; (3) metadata, blobs, and thumbnails have wildly different shapes and each gets a purpose-built store. Every box in the diagram exists to remove one of the three failures from Pass 1.

Step 7

Metadata Sharding

The video metadata DB will hold billions of rows and serve hundreds of thousands of reads per second. We must shard it across many nodes — but which column do we shard on? That choice has consequences for years.

❌ Shard by user_id

"All videos uploaded by user X live on shard hash(X) % N." Easy: every uploader's catalog is co-located, perfect for "show me my videos" queries.

The fatal flaw — hot creators. A megastar uploader with 100M subscribers has every one of their videos on the same shard. When their new video drops, that single shard takes all the read traffic. The shard melts; every other user on it goes down too. This is the textbook celebrity hot-shard problem.

✅ Shard by video_id (winner)

"Each video lives on shard consistent_hash(video_id)." Each new video lands on a uniformly-random shard, regardless of who uploaded it. Even MrBeast's videos are spread across all shards.

The trade-off: "show me all videos by user X" can no longer be answered by hitting one shard — it must scatter-gather across all shards (or be answered by a separate index/projection). For a search-heavy product like Youtube, that's a fine trade — search already hits Elasticsearch, not the metadata DB directly.

Generating video_id — a separate ID service

If video_id is the shard key, it had better be globally unique and well-distributed. We generate it via a separate ID Generation Service (similar pattern to the URL shortener's KGS) that produces 64-bit IDs by combining: timestamp (41 bits) + machine_id (10 bits) + sequence (12 bits) — Snowflake-style. Embedding the timestamp lets us also use the ID for time-range queries ("videos uploaded last hour") without an index scan.

flowchart LR UP["Upload Service"] --> IDS["ID Generation Service"] IDS --> UP UP --> META[("Metadata DB — sharded by video_id")] META -.consistent hash ring.-> S1[("Shard 1")] META -.consistent hash ring.-> S2[("Shard 2")] META -.consistent hash ring.-> S3[("Shard 3")] META -.consistent hash ring.-> SN[("Shard N")] style IDS fill:#171d27,stroke:#9b72cf,color:#d4dae5 style META fill:#171d27,stroke:#4a90d9,color:#d4dae5

The interview move: volunteer "shard by user_id" first as the obvious-looking choice, then identify the hot-creator problem, then switch to "shard by video_id with consistent hashing". This shows you understand why the production answer wins — most candidates jump straight to the answer and miss the failure case that motivates it.

Step 8

Video Deduplication

Many uploaders submit the same video — movie clips ripped from the same source, viral memes shared by ten thousand reposters, news footage everyone re-uploads. Without dedup, we'd transcode and store the same bytes thousands of times: wasted storage, wasted CPU, wasted CDN bandwidth. Inline dedup at upload time is the single biggest cost saver in the system.

Two layers of dedup

🔍 Layer 1 — Cryptographic hash (fast, byte-exact)

As bytes arrive at the Upload Service, we compute a SHA-256 of the file contents. Before transcoding, we look up the hash in a dedup index: if a video with this exact hash already exists, we skip transcoding entirely and just point the new metadata row at the existing variants. Zero storage, zero CPU.

Catches: bit-for-bit identical re-uploads (the same WhatsApp meme passed around 10K times). Misses: anything re-encoded — a different bitrate, a watermark, even a 1-byte metadata change.

🎯 Layer 2 — Perceptual hashing + frame fingerprinting

For each video we compute a perceptual hash — a fingerprint based on visual features (luminance histograms, motion vectors) that's stable under re-encoding, watermarking, and small crops. Sample frames every few seconds and fingerprint each one. Compare against known fingerprints with a similarity threshold (e.g., 95%). If matched, we link to the existing variants and only store the new metadata.

Catches: the same movie clip uploaded at different bitrates, with different watermarks, with different intro/outro crops. Cost: CPU per upload to fingerprint, plus an approximate-search index (FAISS, ScaNN). Worth every cycle: at our scale this saves petabytes.

This is the same conceptual approach Dropbox uses for file dedup, only adapted for video — a known viral clip uploaded by 10,000 different users gets transcoded once, stored once, served from one set of CDN-cached variants for all 10,000 metadata pointers.

Before / after of inline dedup: without dedup, 230 uploads/sec × 25 MB avg × 5 transcoded variants = ~30 GB/sec of new storage forever. With perceptual dedup catching a conservative 30% of uploads as duplicates, we save ~9 GB/sec — over a year, that's roughly 280 PB of storage we never paid to provision and never paid to transcode.

Step 9

Load Balancing

Load balancers sit at three layers in our system, and each plays a different role.

① Client → App tier

Public-facing LB (AWS ALB / nginx). Distributes incoming HTTPS for upload, search, comments, and metadata APIs across the application pods. Health-checks every 5s, evicts unhealthy pods. Terminates TLS so backends don't pay the crypto cost.

② CDN edge selection

DNS-level steering routes the viewer's player to the nearest CDN POP based on geo-IP. Inside a POP, requests are balanced across cache nodes using consistent hashing on the chunk URL — so the same chunk always hits the same cache node, maximizing hit rate.

③ App → DB / Cache

Cassandra and Memcached drivers do this themselves — clients know the cluster topology and route to the right shard's coordinator. For read-heavy paths the driver also balances across replicas to avoid hammering the primary.

Why consistent hashing matters at the CDN edge

The CDN's job is to keep the most-watched chunks in cache. If incoming chunk requests are randomly assigned to cache nodes inside a POP (round-robin), the same chunk ends up cached on every node — wasting cache capacity. With consistent hashing on the chunk URL, every cache node holds a disjoint slice of the keyspace, multiplying effective cache size by the node count.

Smart LBs at the CDN level also use predictive analytics: if the Mumbai POP is approaching saturation (say a viral cricket match clip), DNS steering temporarily redirects new viewers to the Singapore POP — slightly higher latency, but no rebuffering. The viewer sees a smooth experience even as the network reshapes itself.

The LB is itself a single point of failure — solved with active-passive pairs using virtual IP failover, or natively in cloud LBs (ALB is multi-AZ by default). At the CDN layer, anycast routing means many POPs share the same IP and BGP automatically routes around a failed POP.

Step 10

Cache

Caching at Youtube has two completely different jobs: caching small hot metadata (titles, view counts, search results) and caching huge video chunks at the edge. Both follow the 80/20 rule — a small fraction of content drives most reads — but they live on different infrastructure.

Multi-tier cache for video chunks

🥇 Edge tier (smallest, fastest)

Each CDN POP holds a few TB of the most-watched chunks for that region. A trending music video in Brazil sits in São Paulo POPs but not Tokyo POPs. Latency: ~20 ms. Eviction: LRU.

🥈 Regional tier (larger)

Per-region "shield" cache that backs many edge POPs. If a chunk misses at an edge POP, the POP first asks the regional shield before going to origin. Catches the long tail that any one POP doesn't have room for. Latency: ~40 ms.

🥉 Origin tier (huge, source of truth)

Encoded Video Storage ⑦ itself — multi-region S3. Hit rarely (only on cold-tail content or new videos before they're cached anywhere). Latency: ~100 ms. Sized for total catalog, not for traffic.

Memcached for hot metadata

The metadata DB is fronted by a Memcached/Redis cluster. The hottest 20% of video_id → metadata rows fit in tens of GB of RAM — easily holding millions of trending videos. Read flow: app server GET video_id → cache hit returns in microseconds; cache miss falls through to Cassandra and back-fills.

Eviction is LRU — newly trending videos push old ones out naturally. Search result snippets are also cached: "goa travel" → [video_id_1, video_id_2, ...] with a 60-second TTL, so a viral search query (the next World Cup final highlights) hits memory rather than Elasticsearch.

Before / after of multi-tier cache: without caching, every chunk request goes to S3 — at 46K views/sec × ~10 chunks/min, that's hundreds of K of S3 GETs per second, each at 100 ms. Browsers would buffer for seconds. With multi-tier caching, >95% of chunk requests are served from the nearest POP in 20 ms; S3 sees only the cold-tail miss rate. The same architecture serves 100× the load with the same origin sizing.

Step 11

CDN — Content Delivery Network

The CDN is the single most important piece of infrastructure in any video service. Without a CDN, this whole design collapses — there's literally no way to serve 1 TB/s of egress from a single origin. With a CDN, viewer traffic terminates at edge POPs near the viewer, and our origin only sees a tiny cache-fill trickle.

How a CDN actually serves a video

sequenceDiagram actor R as Raj — Mumbai participant DNS as DNS — geo-aware participant POP as CDN POP — Mumbai participant SHLD as Regional Shield — APAC participant ORIG as Origin — Encoded Storage R->>DNS: Resolve cdn.youtube.com DNS-->>R: 1.2.3.4 — Mumbai POP IP R->>POP: GET video_8f2a1c/720p/seg_42.ts alt Edge cache hit POP-->>R: 200 OK — 30 ms total else Edge miss POP->>SHLD: GET video_8f2a1c/720p/seg_42.ts alt Shield hit SHLD-->>POP: 200 OK POP-->>R: 200 OK — 60 ms else Shield miss SHLD->>ORIG: GET seg_42.ts ORIG-->>SHLD: 200 OK SHLD-->>POP: 200 OK POP-->>R: 200 OK — 150 ms first time Note over POP,SHLD: Both cache the chunk for next viewer end end

Adaptive bitrate streaming — HLS / DASH

The CDN doesn't just serve files — it serves the adaptive streaming protocol. Each video is sliced into 10-second chunks, encoded at multiple bitrates. The player downloads a small manifest (HLS .m3u8 or DASH .mpd) listing all the chunk URLs at all qualities, then:

Measures current bandwidth (bytes/sec of the last chunk)
Picks the highest quality that won't cause rebuffering — say 720p at 6 Mbps
Requests the next 10-second chunk at that quality
Re-measures every chunk; if bandwidth drops, switches to 480p mid-stream — same video, same offset, lower bitrate
If bandwidth surges, switches up to 1080p

Why CDN is non-negotiable for video: video isn't like web pages where a few hundred KB make the round trip. A single 1080p hour is ~3 GB. Multiply by 800M DAU watching multiple videos and the egress curve is petabit-class. Only a CDN — physically distributed across hundreds of cities, each with its own backbone connections — can sustain those numbers. There is no "we'll just buy bigger pipes" alternative.

Step 12

Fault Tolerance

At our scale, failures aren't exceptions — they're a constant background. A disk dies somewhere every few minutes; a whole AZ goes dark every few months. The system must absorb these without users noticing.

🔁 Consistent hashing for DB shards

Cassandra's ring uses consistent hashing — when a node dies, only the keys that mapped to it need to be recovered (from replicas), and only the next node on the ring picks up the new write traffic. No global rebalancing. When we add a node to scale up, only 1/N of keys move. Compared to hash % N, this is the difference between a few hours of background work and a multi-day cluster-wide migration.

🌐 3+ AZ replication for video blobs

Both Raw Video Storage ④ and Encoded Video Storage ⑦ replicate every blob to 3 availability zones (S3 does this automatically). A whole zone burning down doesn't lose a single video. For cross-region disasters, we replicate hot/critical content to 2+ regions; cold archive content lives in one region with cross-region backup.

🌍 CDN absorbs origin failures

If the origin Encoded Storage in US-East goes offline, viewers don't notice — the CDN keeps serving cached chunks for hours from edge POPs. Cache TTLs of several hours mean even a multi-hour origin outage is invisible to viewers of currently-trending content. Only cold-tail viewers (first-ever play of an obscure video) see errors during the outage.

🔄 Idempotent transcode jobs

Each transcode job in Kafka has a unique video_id + variant_spec key. If a worker crashes mid-encode, the job is re-delivered to another worker — which simply overwrites any partial output with a clean encode. No coordination needed. Workers themselves are stateless and disposable; we run on spot instances and tolerate kills.

The fault tolerance story in one line: nothing single-points the system — every tier has either replication, queues, or stateless redundancy. The CDN provides a large outage cushion; if our entire control plane went down for 30 minutes, viewers of cached videos wouldn't know.

Step 13

Interview Q&A

Why transcode into multiple resolutions instead of just one high-quality master?

Bandwidth heterogeneity. A single 4K master is 25 Mbps — unwatchable on a 3 Mbps phone (constant rebuffering) and wasteful on a TV that could play it natively but on a constrained home wifi. Pre-transcoding into a ladder (240p, 480p, 720p, 1080p, 4K) lets the player pick what fits the moment — adaptive bitrate streaming literally cannot exist without this. Storage cost is small (~3× single-master) and CPU cost is one-time at upload.

How does adaptive bitrate streaming work?

Three pieces working together. (1) At upload time, the video is encoded into multiple bitrates and chunked into 10-second segments. (2) The CDN serves a manifest file (HLS .m3u8 / DASH .mpd) listing all chunk URLs at all qualities. (3) The player measures its bandwidth on each chunk, picks the highest bitrate that won't drain its buffer, and requests the next chunk at that bitrate. Quality changes are seamless because chunks align across bitrates — same time offset, just different file. The viewer sees video that "magically" matches their network.

Why is CDN non-negotiable for video?

Bandwidth physics. 800M DAU × ~5 videos/day × hundreds of MB each = ~1 TB/sec sustained egress, with peaks far higher. No data center on Earth has petabit egress to one location, and even if it did, the speed of light forces 200 ms RTT to cross-continent viewers. A CDN solves both: hundreds of POPs each handle a slice of the load, each one geographically close to its viewers. Origin only sees cache-fill traffic — a few percent of total.

How do you handle a video going viral — 1 billion views in a day?

The CDN absorbs it entirely, by design. A viral video is the definition of "hot" — within minutes it's cached at every CDN POP worldwide. Origin sees one cache-fill request per POP (a few hundred GETs total), then zero. Viewer traffic terminates at edges. View-count writes use async aggregation (Kafka → Redis sorted set) so the metadata DB never sees 1B inline updates. The bottleneck on a viral video isn't us — it's whether the viewer's own ISP has enough capacity.

How do you detect duplicate uploads?

Two layers. (1) SHA-256 hash at upload time catches byte-identical re-uploads — fast, cheap, but misses anything re-encoded. (2) Perceptual hashing + frame fingerprinting catches the same content with different bitrates, watermarks, or crops. We compute a fingerprint per video and search an approximate-nearest-neighbor index (FAISS) for matches above 95% similarity. Matched uploads skip transcoding entirely and just point metadata at the existing variants — saving petabytes of storage and millions of CPU-hours.

Why shard metadata by video_id and not user_id?

Hot-creator problem. Sharding by user_id puts every video from a popular creator on the same shard — when their new release drops, that one shard gets hammered while others sit idle. Sharding by video_id (with consistent hashing) spreads each creator's videos uniformly across all shards. The trade-off is "show me all videos by user X" can no longer be a single-shard query — it scatter-gathers, or is answered by Elasticsearch/a separate index. For Youtube, where search is the primary discovery path anyway, that trade is free.

Live streaming vs VOD — what's different architecturally?

Live skips the offline transcode step. VOD: upload → store raw → async transcode → store variants → serve via CDN. Live: ingest a continuous RTMP stream → transcode in real time on dedicated GPU encoders → push the latest 10-second chunk straight to the CDN with a low TTL. The whole pipeline runs with sub-5-second latency end-to-end. CDN caching still works (every POP caches the same live chunk for thousands of concurrent viewers) but the TTL is seconds, not days. Comments and likes also flow through real-time fan-out (websockets, not polling).

How do you ensure no video is ever lost?

Defense in depth. (1) The raw upload writes to S3 with multi-AZ replication before the user gets a 200 — if any single zone fails, the bytes are still there. (2) The transcode pipeline is idempotent and can re-run from raw at any time, so a transcode failure is recoverable. (3) Encoded variants are also multi-AZ replicated. (4) For high-tier creators we cross-region replicate to a second continent — a meteor hitting US-East doesn't lose Beyoncé's catalog. Combined durability is ~11 nines (1 in 100 billion year loss probability per object).

The one-line summary the interviewer remembers: "It's three planes — Upload, Process, Serve — connected by an async queue. Uploads land in S3, a transcode farm fans them out into a resolution ladder, and the CDN serves chunks from edge POPs at petabit scale. Metadata sits in Cassandra sharded by video_id, thumbnails in BigTable, search in Elasticsearch. Origin never sees viewer traffic in steady state."