Instagram — HLD Design

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →

Step 1

What is Instagram?

It's 14:02 in San Francisco. Sarah lifts her phone, snaps a photo of the sunset over the Golden Gate Bridge, and taps "Share". Thirty seconds later, in Tokyo — 5,100 miles away — Raj opens Instagram on his commute home. The very first thing he sees on his feed is Sarah's sunset, sharp and instantly loaded, sitting at the top of a personalised stream of photos curated from the 312 people he follows. He didn't search for it. He didn't refresh and wait. The system knew the photo would matter to him, fetched it from a server geographically close to him, and rendered it in under 200ms. That choreography — between Sarah's upload, the global photo storage, the follow graph, and Raj's pre-computed timeline — is what we're designing.

Instagram (in its original photo-sharing form) lets users upload photos and videos, follow other users, and see a personalised news feed made up of the latest top photos from the people they follow. Two ingredients drive every architectural decision: photos are big and write-once read-many, and feeds are per-user and have to feel instant.

The two questions that drive every design decision below: (1) How do we store and serve billions of 200KB photo blobs at low latency without breaking the bank? (2) How do we generate a personalised feed for 1M daily active users in under 200ms, when each feed depends on hundreds of follow relationships and millions of new photos a day?

Step 2

Requirements & Goals

Before drawing a single box, pin down what the system must (and must not) do. Asking these questions in an interview shows you're not just regurgitating a memorised solution — you're listening for the actual constraints.

✅ Functional Requirements

Users can upload, download, and view photos
Users can perform search by photo title or caption
Users can follow other users
The system generates a news feed of top photos from the people a user follows
Feeds should feel personalised and timely

⚙️ Non-Functional Requirements

Highly available — Instagram is the entry point for a billion users; downtime is brand damage
News feed latency under 200ms — anything slower feels sluggish on a phone
Eventually consistent — a brand-new photo not appearing in a follower's feed for a few seconds is fine
100% reliable photo storage — once uploaded, a photo must never be lost

🚫 Out of Scope

Tagging photos with users or locations
Comments and likes
"Who to follow" recommendations
Stories, Reels, DMs — original photo-sharing surface only

The non-functional requirements are the hard ones. Uploading a photo is a 30-line problem. Serving 23 photos uploaded per second to a billion read requests with under 200ms feed latency, while never dropping a single byte of someone's wedding photo, is the part that demands architecture.

Step 3

Capacity Estimation & Constraints

Numbers drive every architectural choice — sharding, caching, load balancing, replication factor. Do them out loud, even if rough. Instagram is read-heavy by an enormous margin: every uploaded photo is viewed hundreds of times by followers, often thousands of times for popular accounts.

User & traffic estimates

Assume 500M total users, 1M daily active users (DAU). Each DAU posts ~2 photos per day on average → 2M new photos/day.

Uploads

~23 photos/sec

2M / 86,400

Avg photo size

~200 KB

typical compressed JPEG

Daily ingest

~400 GB/day

2M × 200KB

10-yr storage

~1,425 TB

400GB × 365 × 10

Storage by table (10-year horizon)

Big numbers, but each table has very different shape — and that drives where each lives.

Table	Row size	Rows over 10 yrs	Total	Notes
Photo metadata	`~284 bytes`	`2M × 365 × 10 = 7.3B`	`~1.88 TB`	Photo blobs in S3; only metadata in DB
User	`~68 bytes`	`500M`	`~32 GB`	Tiny; fits comfortably on one shard
UserFollow	`~16 bytes`	`500M × ~250 follows = 125B`	`~1.82 TB`	Wide-column friendly (one row per user)
Total metadata	—	—	`~3.7 TB`	Excludes photo blobs (1,425 TB in S3)

Bandwidth — read-heavy by orders of magnitude

Uploads are bounded: 23 photos/sec × 200KB ≈ ~4.6 MB/s ingress. Reads are not. If each DAU views ~500 photos/day across feed-scrolling, profile visits, and explore pages, that's 1M × 500 = 500M views/day ≈ ~5,800 reads/sec, or ~1.16 GB/s egress. Reads are roughly 250× writes by request count and bandwidth, which is exactly why CDN + cache + read-replica DB are non-negotiable.

So what: photos themselves (1,425 TB) absolutely cannot live in a relational DB — they go to S3 / HDFS. Metadata (3.7 TB) is small enough to live in a sharded NoSQL store like Cassandra. And bandwidth dominates everything: 1.16 GB/s of read traffic must be fanned out through a CDN, not served from origin.

Step 4

System APIs

Five endpoints carry virtually all the traffic: upload a photo, fetch a photo, get the feed, follow a user, and search. Defining the API surface up front locks down the contract before architecture.

REST API surface

// Upload — write path. Multipart (the photo bytes ride in the body).
POST /api/v1/photos
Content-Type: multipart/form-data
{
  "api_dev_key": "abc123...",
  "user_id":     42,
  "title":       "Sunset over Golden Gate",
  "photo":       <binary blob, ~200KB>
}
→ 201 Created  { "photo_id": "p_8f2a", "cdn_url": "https://cdn.instagram.com/p_8f2a.jpg" }

// Fetch a single photo — high QPS, served via CDN almost always
GET /api/v1/photos/:photo_id
→ 302 Found  Location: https://cdn.instagram.com/p_8f2a.jpg

// News feed — the personalised stream for one user
GET /api/v1/feed?user_id=99&limit=20&cursor=...
→ 200 OK
{
  "photos": [
    { "photo_id": "p_8f2a", "owner": "sarah", "cdn_url": "...", "created_at": "..." },
    ...
  ],
  "next_cursor": "..."
}

// Follow another user
POST /api/v1/follow/:target_user_id
Headers: { "X-API-Dev-Key": "abc123..." }
→ 204 No Content

// Search photos by title / caption
GET /api/v1/search?q=sunset&limit=20
→ 200 OK  { "photos": [ ... ] }

Why upload returns a CDN URL, not the bytes: the client uploads once, then everyone (including the uploader's own profile fetch) reads from the CDN. That single indirection means our origin servers handle 23 writes/sec but zero of the 5,800 reads/sec — the CDN absorbs it.

Pagination via cursors, not offsets: the feed uses an opaque cursor (last-seen photo_id + timestamp) instead of offset=200. Offsets break when new photos arrive between page fetches; cursors are stable.

Step 5

Database Schema

Three observations drive the schema design: (1) photo blobs are huge but metadata is tiny, so they live in different stores; (2) the access patterns are key-based and graph-based, not relational-join-heavy; (3) the follow graph is read on every feed fetch, so it must be indexed for "give me everyone user X follows" in O(1).

erDiagram PHOTO { string photo_id PK bigint user_id FK string photo_path float latitude float longitude timestamp creation_date string title } USER { bigint user_id PK string name string email timestamp creation_date timestamp last_login } USERFOLLOW { bigint user_id1 PK bigint user_id2 PK timestamp followed_at } USER ||--o{ PHOTO : "uploads" USER ||--o{ USERFOLLOW : "follows"

Where each table lives

📦 Photo blobs

The 200KB JPEGs themselves go to distributed object storage — S3 or HDFS. Each blob is named by its photo_id. Replication-3, multi-AZ, served via CDN. Never store these in a database.

📋 Photo metadata

Stored in a wide-column NoSQL store like Cassandra. Primary key = photo_id. Cassandra's quorum reads, multi-region replication, and tunable consistency are exactly what we need.

🕸️ UserFollow graph

Wide-column friendly: one row per user, columns are the user_ids they follow. Reading "who does Raj follow?" becomes a single row read. Same store (Cassandra) keyed by user_id1.

Wide-column layout for UserPhoto and UserFollow

NoSQL wide-column shines here. For a "user's photos" table, we use the user_id as row key and photo_ids as columns. Reading "all of Sarah's photos in reverse-chrono order" is a single row read instead of a fanout query. Same trick for the follow graph: row key = user, columns = users they follow.

Why NoSQL beats SQL here: our queries are "give me user X's photos", "who does X follow?", "give me photo P's metadata" — all single-key lookups at huge scale. SQL's joins and ACID buy us nothing on this workload, and they cost us horizontal scaling. Cassandra also gives us free multi-region replication, which we'll need for global low-latency reads.

Step 6 · CORE

High-Level Architecture — From Naive to Production

This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. Numbers from §3 drive every decision.

Pass 1 — The naive design (and why it breaks)

The simplest possible system: one app server, one MySQL database, and we'll just be lazy and store photo BLOBs as BLOB columns in MySQL alongside the metadata. Uploads write to MySQL. Reads SELECT the BLOB column and stream it back to the client. Feed generation runs a giant join across users, follows, and photos at request time.

flowchart LR C["Client — Phone, Web"] --> APP["App Server"] APP --> DB[("MySQL — metadata + photo BLOBs + follow graph")]

Three concrete failures emerge the moment real traffic shows up:

💥 Photo uploads block reads

A single 200KB photo upload pins one MySQL connection for the duration of the byte transfer. At 23 uploads/sec each holding a connection for 1-2 seconds, write connections saturate. Read queries queue up behind them — feed latency spikes from 50ms to 5 seconds. The "uploads choke reads" failure is the first one a real workload exposes.

💥 MySQL chokes on 200KB BLOBs

Relational DBs are tuned for small rows. A 200KB BLOB blows the row out of one page, fragments the buffer pool, and shreds throughput. At 1,425 TB over 10 years, we'd need a MySQL cluster the size of a data center just to hold the bytes — at 100× the cost of S3, with worse durability.

💥 Feed generation is a triple join

To build Raj's feed, the naive query is SELECT photos.* FROM follows JOIN photos ON ... WHERE follows.user = raj ORDER BY created DESC LIMIT 20. With 250 follows × thousands of photos each, this scans millions of rows on every feed open. Doing it under 200ms for 1M DAU? Impossible without pre-computation.

Pass 2 — The mental model: three independent planes

The single most important insight in this design is that three workloads have wildly different shapes and must scale independently. Pass 1 fails because all three live on the same box, fighting for the same CPU, disk, and connection pool.

⬆️ Upload Plane

23 req/sec, slow, bandwidth-bound. Each request streams a 200KB photo. The bottleneck is network and disk, not CPU. Latency budget is generous — users tolerate a 1-2 second upload spinner. Goal: never drop a byte, never lose a photo.

📖 Read / Serve Plane

5,800+ req/sec, fast, latency-sensitive. Photos are immutable and cacheable up to the moon. Served from CDN edge POPs globally so a Tokyo user doesn't pay a trans-Pacific round trip. Goal: under 50ms for a cached photo, under 200ms for a feed.

🌀 Feed Generation Plane

Async, expensive, runs offline. Pre-computes per-user timelines so the feed read is a single cache lookup, not a triple join. Workers fan out new photos to followers' timelines in the background, decoupled from the request path.

This three-way split is THE central insight. Splitting these planes lets each scale independently and prevents uploads from breaking reads — which was the very first failure of Pass 1. Goodbye, blocked connection pool.

Pass 3 — The production shape

Now the full picture. Every node is numbered ①–⑫ — find its matching card below to see what it does, why it's there, and crucially what would break without it.

flowchart TB CL["① Client — Mobile · Web"] LB["② Load Balancer"] subgraph UPLOAD["Upload Plane"] UPS["③ Upload Service"] BLOB[("④ Image Storage — S3 / HDFS")] META[("⑤ Image Metadata DB — Cassandra")] end subgraph SERVE["Read / Serve Plane"] RS["⑥ Read / Serve Service"] CDN["⑦ CDN — CloudFront / Akamai"] end subgraph FEED["Feed Plane"] FG["⑧ Feed Generation Service"] TLC[("⑨ User Timeline Cache — Redis")] UG[("⑩ User Graph DB — UserFollow")] end subgraph EXTRA["Adjacent Services"] NS["⑪ Notification Service"] SS["⑫ Search Service — Elasticsearch"] end CL --> LB LB -->|"upload"| UPS LB -->|"fetch photo"| RS LB -->|"get feed"| FG UPS --> BLOB UPS --> META UPS -.fanout event.-> FG UPS -.index event.-> SS RS --> CDN CDN -.miss.-> BLOB FG --> TLC FG --> UG FG --> META FG -.push.-> NS style CL fill:#e8743b,stroke:#e8743b,color:#fff style LB fill:#171d27,stroke:#9b72cf,color:#d4dae5 style UPS fill:#171d27,stroke:#e8743b,color:#d4dae5 style BLOB fill:#171d27,stroke:#38b265,color:#d4dae5 style META fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style RS fill:#171d27,stroke:#4a90d9,color:#d4dae5 style CDN fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style FG fill:#171d27,stroke:#9b72cf,color:#d4dae5 style TLC fill:#171d27,stroke:#e05252,color:#d4dae5 style UG fill:#171d27,stroke:#38b265,color:#d4dae5 style NS fill:#171d27,stroke:#d4a838,color:#d4dae5 style SS fill:#171d27,stroke:#d4a838,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card. Each one answers what is this thing, why is it here, and what would break without it.

① Client

The Instagram mobile app on Sarah's iPhone or the web client in Raj's browser. The client is responsible for compressing the photo locally before upload (resizing to a sane max resolution and JPEG-compressing) so the server doesn't waste bandwidth on raw camera bytes. It also caches recently-viewed photos in local memory so scrolling back up is instant.

Solves: nothing on its own — but every design choice flows backward from "what does the client experience?" Latency, cell-network reliability, and the upload protocol are all client-facing concerns.

② Load Balancer

The traffic cop. Sits in front of upload, read, and feed services, distributes incoming HTTPS, terminates TLS, and yanks unhealthy backends out of rotation via 5-second health checks. Round-robin is fine to start; switch to least-connections once you have enough nodes that load skew matters.

Solves: single-server failure and uneven load. Without an LB, one app crash takes down the service. With it, we lose 1/N of capacity for a few seconds until the bad node gets evicted.

③ Upload Service

Stateless service handling POST /api/v1/photos. Per request: validate the multipart payload, generate a unique photo_id, stream the bytes to Image Storage ④ in chunks, then write a metadata row to the Image Metadata DB ⑤. Finally it emits an async event to the Feed Generation Service ⑧ saying "user X just posted photo Y, fan it out to their followers" and another event to Search ⑫ to index the title. Stateless → we just add pods behind the LB to scale.

Solves: isolating the slow, bandwidth-heavy upload work from everything else. Without a dedicated upload tier, photo bytes streaming through the same connection pool as feed reads would block reads — the original Pass 1 failure.

④ Image Storage (S3 / HDFS)

The actual photo bytes — billions of 200KB blobs — live here. Each photo is written to a path keyed by photo_id, replicated 3× across availability zones, never deleted (only soft-deleted via metadata flag). At 1,425 TB over 10 years, this is the biggest bill in the system, but storage at this scale is dirt-cheap compared to running it on relational DB disks.

Solves: the "where do the bytes physically live" problem. A relational DB cannot economically hold 1.4 PB of opaque blobs. Object storage is purpose-built: cheap, durable (11-nines), automatically replicated, and directly addressable from a CDN.

⑤ Image Metadata DB (Cassandra)

A wide-column NoSQL store holding the photo table — photo_id, user_id, photo_path, lat, long, creation_date, title. Sharded by photo_id with consistent hashing, replicated 3× across AZs. Cassandra's R=2, W=2, N=3 quorum gives us strong-enough consistency without sacrificing availability. Total size: ~1.88 TB over 10 years — easy at this distribution.

Solves: indexing the photo blobs so we can answer "what photos does Sarah have?", "when was photo X uploaded?", "where was it taken?" without reading the bytes themselves. Without metadata, the blob store is opaque — we couldn't even find a photo without already knowing its path.

⑥ Read / Serve Service

Stateless service handling GET /api/v1/photos/:photo_id. Per request: look up the photo metadata, then return a 302 redirect to the CDN ⑦ URL. For most photos, the client never even talks to this service again — the second hit comes straight from CDN. Scaled aggressively because read traffic is 250× write traffic.

Solves: serving the redirect at scale and acting as the gatekeeper for visibility / privacy checks before exposing the CDN URL. Without it, the upload service would have to also handle reads — combining the slow path with the hot path, which we've already said is fatal.

⑦ CDN (CloudFront / Akamai)

A globally-distributed edge network in 200+ cities. Photos are immutable and cache-friendly to a fault — once a CDN edge in Tokyo has the bytes, every Tokyo user's request is answered in 20ms with zero load on our origin. Hot photos (a celebrity's latest post) live in every edge POP within seconds.

Solves: global latency and origin bandwidth. Without a CDN, every photo view from Tokyo travels to our US-East data center and back — 200ms of speed-of-light penalty just on physics. And our origin would have to push 1.16 GB/s constantly. With a CDN, origin sees only the cold-tail traffic — maybe 5% of total reads.

⑧ Feed Generation Service

The brain of the personalised stream. Async workers consume "new photo" events from the upload service, look up the uploader's followers in the User Graph DB ⑩, and push the new photo onto each follower's pre-computed timeline in the Timeline Cache ⑨. When a user opens the app, fetching their feed becomes a single Redis read — no joins, no scans. Hybrid push/pull (see §9) handles celebrity edge cases.

Solves: the 200ms feed latency requirement. Without pre-computation, every feed open would scan all of a user's follows' photos and rank them — impossible to do under 200ms for 1M DAU. By doing the work at write time once, we make every read O(1).

⑨ User Timeline Cache (Redis)

An in-memory key-value store keyed by user_id, whose value is the user's pre-computed timeline — a list of the most recent ~500 photo_ids (with metadata) from people they follow. Backed by Redis sorted sets (score = timestamp) so range queries by time are trivially fast.

Solves: the speed of feed reads. Memory is microsecond-fast; disk is millisecond-fast. With 1M DAU × ~10KB timeline each = 10GB — fits trivially in a small Redis cluster. Without this, every feed open hits Cassandra with a multi-key fanout query.

⑩ User Graph DB (UserFollow)

Holds the follow graph. Wide-column layout: row key = user_id, columns = users they follow (~125B rows total over 10 years). Reading "who does Raj follow?" is a single row read. Reading "who follows Sarah?" (the inverse) needs a denormalised second table — same layout, keyed by the followee.

Solves: graph queries on the follow relationship at scale. The feed generator can't fan a new photo out to followers without a fast "give me this user's followers" query, and a relational JOIN on a 125B-row table doesn't end well.

⑪ Notification Service

Pushes a phone notification ("Sarah just posted!") to followers who have notifications enabled. Consumes the same "new photo" event as feed generation, looks up notification preferences, and fires off APNS/FCM messages. Async — never on the upload path.

Solves: engagement (telling Raj a photo he'd care about exists). Without it, users would have to open the app to discover new content. Crucially separated from upload so notification backpressure can never slow uploads.

⑫ Search Service (Elasticsearch)

An inverted index of photo titles and captions, kept up to date by ingesting "new photo" events from the upload service. Handles GET /api/v1/search?q=sunset in tens of milliseconds. Photo metadata in Cassandra is the source of truth; Elasticsearch is purely a derived index.

Solves: the search-by-title functional requirement. Cassandra is great at key lookups but terrible at full-text search. Elasticsearch is purpose-built for it. Decoupled and async — losing the search index for an hour doesn't affect uploads or feeds.

Concrete walkthrough — Sarah uploads, Raj scrolls

Two real flows mapped to the numbered components above:

📸 Upload flow — Sarah posts a sunset at 14:02:06

Sarah's iPhone ① compresses the 4MB raw photo to a 200KB JPEG and POSTs to /api/v1/photos.
The Load Balancer ② routes to an Upload Service ③ pod.
Upload service generates photo_id = p_8f2a, streams the bytes to Image Storage ④ (S3), and writes the metadata row to Image Metadata DB ⑤.
Upload service emits a "new photo" event onto Kafka. Feed Generation ⑧, Notification ⑪, and Search ⑫ all consume it asynchronously.
Upload service returns { photo_id, cdn_url } to Sarah's phone. Total elapsed: ~1.2 seconds, mostly bandwidth.
Meanwhile, in the background: feed generation looks up Sarah's followers in User Graph DB ⑩, and for each follower (~600 of them) pushes the photo onto their timeline in Timeline Cache ⑨. Done within ~5 seconds of upload.

📱 Feed-fetch flow — Raj opens Instagram in Tokyo at 14:03:11

Raj's app ① calls GET /api/v1/feed?user_id=raj.
Load Balancer ② routes to a feed endpoint on Feed Generation Service ⑧.
Feed service does a single Redis ⑨ read on key timeline:raj → gets back a list of 20 photo_ids including p_8f2a at the top. ~3ms.
For each photo_id, hydrate metadata from Image Metadata DB ⑤ (or a cache in front of it). ~10ms.
Return the JSON list with cdn_urls embedded. Raj's app fetches each photo from the CDN ⑦ — Tokyo edge POP, ~20ms each, fully parallel.
Sarah's sunset renders at the top of Raj's feed. Total elapsed: under 200ms from app open to first paint.

So what: the architecture is built around three insights — (1) uploads, reads, and feed generation have totally different shapes so each gets its own plane; (2) photo bytes never travel through the same path as photo metadata or feed answers, which is why the CDN can absorb 95% of read traffic with zero origin load; (3) feeds are pre-computed at write time, not at read time, which is the only way to hit the 200ms latency budget. Every box in the diagram earns its place by removing one of those failure modes from Pass 1.

Step 7

Reliability & Redundancy

Instagram's promise is "your wedding photo from 2014 will still be there in 2034". One byte lost is reputational damage forever. Reliability is non-negotiable — and it has a price tag, which we pay deliberately.

🗂️ Photo storage replicated 3×

Every photo blob is written to three availability zones, in different buildings on different power grids. S3 promises 99.999999999% durability (11 nines) — at our scale that's about 1 photo lost per 10⁶ uploaded across the lifetime of the service. Acceptable.

🔁 Stateless app servers

Upload, read, and feed services hold no local state. Killing any pod loses zero data — the next request just hits a sibling pod. We run N+2 capacity per service so we can lose two pods at once and keep serving. Deployments are rolling: 10% of pods at a time, health-checked between waves.

🌐 Active-active across regions

Every service is deployed to multiple regions (US-East, EU, APAC), all serving live traffic. DNS routes a user to their nearest region. If a whole region falls over, traffic shifts to the others within minutes. Cassandra's multi-region replication keeps metadata consistent eventually.

The reliability hierarchy: photo blobs → never lost (3× replication, multi-AZ); metadata → eventually consistent (Cassandra quorum); feeds → best-effort (Redis LRU eviction is fine, we'll just regenerate from metadata if needed). The further from "the photo bytes themselves", the more we tolerate transient loss.

Step 8

Data Sharding — by user_id or by photo_id?

1.88 TB of photo metadata fits on one box only on paper — at this volume we need to shard before we hit a single-node ceiling on either disk or QPS. The interesting question is what to shard on. Two strategies, and the trade-off matters.

❌ Shard by user_id

Compute shard = hash(user_id) % N. All of a user's photos live on the same shard. Reading "Sarah's profile" is a single-shard query — fast. But three problems:

Hot users overload one shard. A celebrity with millions of photos crushes their shard's CPU and disk. Their followers' read traffic on that one shard is 1000× heavier than another user's shard.
Uneven distribution. Power-law popularity means some shards run hot, others run cold. Capacity is wasted.
Profile photo limit. If a single user has more photos than fits on one shard, we'd have to split them — defeating the original "one user, one shard" simplicity.

✅ Shard by photo_id (production winner)

Compute shard = consistent_hash(photo_id) % N. Each photo lands on a uniformly-random shard. Distribution is even by construction — no hot shards.

Uniform load across shards regardless of user popularity
Trivial horizontal scaling — add a shard, consistent hashing reshuffles 1/(N+1) of keys
One trade-off: reading "Sarah's photos" now requires a fanout query across all shards. Solved by the wide-column UserPhoto table (keyed by user_id) which acts as a secondary index

Photo ID generation — embed the timestamp

Like the URL shortener's KGS, we want photo_ids that are globally unique without coordination. Use a 64-bit ID with the layout:

┌──────────────────┬───────────┬──────────┐
│  41-bit timestamp │ 10-bit shard │ 13-bit seq │
└──────────────────┴───────────┴──────────┘

Bits 0-40 are milliseconds since epoch (good for ~70 years). Bits 41-50 identify the generator shard (1024 possible). Bits 51-63 are a per-millisecond counter (8192/ms per shard). This is essentially the Twitter Snowflake scheme. Big win: sorting photo_ids gives time-sorted access for free — useful for "the last 20 photos" queries.

The interview move: volunteer "shard by user_id" first as the natural simple answer, walk through the celebrity-hot-shard failure mode, then propose photo_id sharding with consistent hashing as the fix. This shows you understand why the better answer is better, not just that it exists.

Step 9

News Feed Generation — Push, Pull, or Hybrid?

The feed is Instagram's killer feature and the hardest engineering problem in the system. The question: when Sarah uploads a photo, when does it materialise in Raj's feed? Three strategies, and the right answer is "all three, depending on user".

flowchart LR subgraph PULL["Pull — fan-out-on-load"] R1["Raj opens app"] --> Q1["Query all of Raj's follows"] Q1 --> M1["Merge + rank photos"] M1 --> F1["Return feed — slow"] end subgraph PUSH["Push — fan-out-on-write"] S1["Sarah uploads"] --> W1["Look up Sarah's followers"] W1 --> X1["Push photo onto each follower's timeline"] X1 --> READ["Raj opens — single timeline read — fast"] end style PULL fill:#171d27,stroke:#e05252,color:#d4dae5 style PUSH fill:#171d27,stroke:#38b265,color:#d4dae5

🐢 Pull (fan-out-on-load)

Do nothing on upload. When Raj opens the app, query all 312 of his follows for their recent photos, merge by timestamp, return the top 20.

✓ Cheap on write (the celebrity case is fine)
✗ Slow on read — hits hundreds of shards, hard to do in 200ms
✗ Wasted work for inactive users (we still query but they never look)

⚡ Push (fan-out-on-write)

On every upload, push the photo onto every follower's pre-computed timeline. Reads are O(1).

✓ Reads are blazing fast — single Redis lookup
✗ Catastrophic for celebrities — a user with 100M followers triggers 100M writes per upload
✗ Wastes work pushing to inactive followers' timelines

The hybrid wins — push for normal users, pull for celebrities

Most users have a few hundred followers. Pushing 600 timeline writes per upload is cheap and keeps reads fast. But celebrities (Cristiano Ronaldo with 600M followers) break push entirely — one of his uploads would trigger 600M timeline writes, melting Redis.

The hybrid: classify each user as normal or celebrity (e.g., >1M followers). When Sarah (normal) uploads, push to her followers' timelines. When a celebrity uploads, don't push — instead, on every feed read, the feed service merges the user's pushed timeline with a pull-query of any celebrities they follow. Celebrities are few, so the pull-side merge stays cheap.

Pre-generate top 500 per active user

Each active user's timeline holds the top 500 most recent photos in Redis. Sliding window — when a new photo arrives, push it to the front, evict the oldest. Inactive users (no app open in 30 days) get their timeline torn down — we'll regenerate it lazily if they come back.

So what: the feed is a write-time problem, not a read-time problem. Pre-computing turns "scan thousands of photos per request" into "single Redis read per request" — 100× the writes for 1000× the read speedup. That's the whole game.

Step 10

Cache & Load Balancing

The single biggest performance lever in the system. Without caching, every photo view hits S3 and every feed open hits Cassandra. With the right cache layout, the origin sees a fraction of the actual user load.

CDN — for photo blobs at the edge

CloudFront / Akamai caches the actual photo bytes at edge POPs in 200+ cities. Photos are immutable (once a photo_id is taken, the bytes never change), so the CDN can cache aggressively with a long TTL — days or weeks. The CDN has its own LRU; popular photos stay in every edge POP, cold photos roll off naturally.

Memcache — for hot metadata rows

In front of Cassandra, we run a Memcache cluster holding hot photo_id → metadata mappings and hot user profile rows. Eviction is LRU. Sized at ~10% of total metadata = ~200GB across a small cluster.

The 80/20 rule applies

20% of photos generate 80% of view traffic. Caching just the hot 20% absorbs the bulk of reads. Specifically: cache the top 20% of daily volume, not the top 20% of all-time volume — yesterday's viral post is irrelevant today.

Load balancing at three layers

① Client → App tier

Public-facing AWS ALB / nginx. Distributes incoming HTTPS to upload, read, and feed pods. Health-checks every 5s, evicts unhealthy pods. Terminates TLS so backend pods don't pay the crypto cost.

② App → Cache

Client-side or sidecar LB. Uses consistent hashing on the cache key so the same key always lands on the same Memcache node — maximises hit rate, avoids cache duplication.

③ App → DB

Cassandra drivers handle this themselves — the client knows the cluster topology and routes each query to the coordinator node for the relevant token range.

The LB itself is a single point of failure — solved with an active-passive pair using virtual IP failover, or natively in cloud LBs (ALB is multi-AZ by default).

Step 11

Interview Q&A

Why shard by photo_id instead of user_id?

Uniform load distribution. Sharding by user_id means a celebrity's shard gets crushed: their photos and the read traffic for them concentrate on one node, while another shard sits idle. Sharding by photo_id with consistent hashing distributes both storage and read load uniformly. The trade-off — that "give me Sarah's photos" now spans many shards — is solved by a denormalised UserPhoto wide-column table keyed by user_id that acts as a secondary index.

How do you handle a celebrity user with 100M followers?

Don't push their photos to followers' timelines. Push fan-out works great for normal users (a few hundred follower writes per upload), but a celebrity upload would trigger 100M writes — a Redis-killer. Instead, classify users as normal or celebrity. Celebrities skip push entirely; on each follower's feed read, the feed service merges the follower's pushed timeline with a small pull-query of any celebrities they follow. Celebrities are few, so the merge stays cheap.

How do you generate the news feed for a user with 5,000 follows?

Pre-compute it via push fan-out. When any of those 5,000 users uploads, the feed generation service immediately writes the new photo into the requesting user's Redis timeline. By the time the user opens the app, the feed is already a single Redis read away — no scanning, no joins, no per-request fanout. The push happens once per upload, regardless of how many followers the uploader has (capped via the celebrity exclusion).

Why pre-generate feeds instead of computing on read?

Latency budget. Computing a feed on read means querying every shard the user's follows are scattered across, merging by timestamp, and ranking — easily 500ms+ for someone with 1000 follows. Pre-generating moves that work to upload time (where the latency budget is generous), turning the read into an O(1) Redis lookup. We trade more total compute (we pre-compute for users who never look) for blazingly fast reads. With 1M DAU and most having 1-2 daily app opens, the math works.

How do you ensure photos aren't lost?

Three layers of durability. (1) Photo blobs written to S3 with 3× replication across availability zones — 11 nines durability. (2) Upload service waits for "durable" ack from S3 before returning success to the client; if the ack never comes, the client retries. (3) Metadata writes use Cassandra quorum (W=2 of N=3) — at least two replicas have the metadata before we acknowledge. Combined: a photo is "lost" only if all 3 S3 zones lose the bytes simultaneously AND 2 of 3 Cassandra replicas die — vanishingly rare.

Push vs. pull vs. hybrid for feed delivery — when to use each?

Push when the average user has few followers (Twitter early days, Instagram default for normal users). Reads are blazing fast, write cost stays bounded. Pull when celebrity users dominate — fan-out on write would be ruinous. Slow-but-cheap reads. Hybrid is what real systems use: push for the long tail, pull for celebrities, merge at read time. Instagram, Twitter, and Facebook all do this. The classification threshold (e.g., "celebrity = >1M followers") is a tuning knob — set it where push cost equals pull cost.

Why store photo bytes in S3 instead of the database?

Three reasons. (1) Cost — S3 storage is roughly 10× cheaper than relational DB disk. At 1.4 PB over 10 years, that's the difference between a $30K/month bill and a $300K/month bill. (2) Performance — relational DBs are tuned for small rows; 200KB BLOBs fragment the buffer pool and tank throughput. (3) CDN integration — S3 URLs can be served directly from CloudFront with no origin involvement. A DB BLOB requires the app server to stream it through itself, eating CPU and bandwidth.

How do you keep the search index in sync with photo uploads?

Async event stream. Upload service emits a "new photo" event to Kafka after the durable write to S3 + metadata. Elasticsearch consumers pick it up and index the title/caption within seconds. Loose coupling means: a search outage doesn't block uploads, an upload outage doesn't corrupt the index, and we can rebuild the entire search index from scratch by replaying Cassandra into Kafka if needed. The price: search results may lag uploads by a few seconds — perfectly acceptable for our requirements.

The one-line summary the interviewer remembers: "Three independent planes — uploads, reads, and feed generation — each scaled on its own clock. Photo bytes live in S3 served via CDN; metadata in sharded Cassandra; feeds pre-computed via push fan-out into Redis with a pull fallback for celebrities. The whole architecture is designed so reads never wait for writes, and feeds never wait for joins."

Instagram — Photo Sharing Service