From a single MySQL box that crumbles under 200KB photo blobs to a three-plane production system where uploads, reads, and feed generation each scale on their own clock
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
It's 14:02 in San Francisco. Sarah lifts her phone, snaps a photo of the sunset over the Golden Gate Bridge, and taps "Share". Thirty seconds later, in Tokyo — 5,100 miles away — Raj opens Instagram on his commute home. The very first thing he sees on his feed is Sarah's sunset, sharp and instantly loaded, sitting at the top of a personalised stream of photos curated from the 312 people he follows. He didn't search for it. He didn't refresh and wait. The system knew the photo would matter to him, fetched it from a server geographically close to him, and rendered it in under 200ms. That choreography — between Sarah's upload, the global photo storage, the follow graph, and Raj's pre-computed timeline — is what we're designing.
Instagram (in its original photo-sharing form) lets users upload photos and videos, follow other users, and see a personalised news feed made up of the latest top photos from the people they follow. Two ingredients drive every architectural decision: photos are big and write-once read-many, and feeds are per-user and have to feel instant.
Before drawing a single box, pin down what the system must (and must not) do. Asking these questions in an interview shows you're not just regurgitating a memorised solution — you're listening for the actual constraints.
Numbers drive every architectural choice — sharding, caching, load balancing, replication factor. Do them out loud, even if rough. Instagram is read-heavy by an enormous margin: every uploaded photo is viewed hundreds of times by followers, often thousands of times for popular accounts.
Assume 500M total users, 1M daily active users (DAU). Each DAU posts ~2 photos per day on average → 2M new photos/day.
~23 photos/sec
2M / 86,400
~200 KB
typical compressed JPEG
~400 GB/day
2M × 200KB
~1,425 TB
400GB × 365 × 10
Big numbers, but each table has very different shape — and that drives where each lives.
| Table | Row size | Rows over 10 yrs | Total | Notes |
|---|---|---|---|---|
| Photo metadata | ~284 bytes | 2M × 365 × 10 = 7.3B | ~1.88 TB | Photo blobs in S3; only metadata in DB |
| User | ~68 bytes | 500M | ~32 GB | Tiny; fits comfortably on one shard |
| UserFollow | ~16 bytes | 500M × ~250 follows = 125B | ~1.82 TB | Wide-column friendly (one row per user) |
| Total metadata | — | — | ~3.7 TB | Excludes photo blobs (1,425 TB in S3) |
Uploads are bounded: 23 photos/sec × 200KB ≈ ~4.6 MB/s ingress. Reads are not. If each DAU views ~500 photos/day across feed-scrolling, profile visits, and explore pages, that's 1M × 500 = 500M views/day ≈ ~5,800 reads/sec, or ~1.16 GB/s egress. Reads are roughly 250× writes by request count and bandwidth, which is exactly why CDN + cache + read-replica DB are non-negotiable.
Five endpoints carry virtually all the traffic: upload a photo, fetch a photo, get the feed, follow a user, and search. Defining the API surface up front locks down the contract before architecture.
REST API surface// Upload — write path. Multipart (the photo bytes ride in the body). POST /api/v1/photos Content-Type: multipart/form-data { "api_dev_key": "abc123...", "user_id": 42, "title": "Sunset over Golden Gate", "photo": <binary blob, ~200KB> } → 201 Created { "photo_id": "p_8f2a", "cdn_url": "https://cdn.instagram.com/p_8f2a.jpg" } // Fetch a single photo — high QPS, served via CDN almost always GET /api/v1/photos/:photo_id → 302 Found Location: https://cdn.instagram.com/p_8f2a.jpg // News feed — the personalised stream for one user GET /api/v1/feed?user_id=99&limit=20&cursor=... → 200 OK { "photos": [ { "photo_id": "p_8f2a", "owner": "sarah", "cdn_url": "...", "created_at": "..." }, ... ], "next_cursor": "..." } // Follow another user POST /api/v1/follow/:target_user_id Headers: { "X-API-Dev-Key": "abc123..." } → 204 No Content // Search photos by title / caption GET /api/v1/search?q=sunset&limit=20 → 200 OK { "photos": [ ... ] }
cursor (last-seen photo_id + timestamp) instead of offset=200. Offsets break when new photos arrive between page fetches; cursors are stable.Three observations drive the schema design: (1) photo blobs are huge but metadata is tiny, so they live in different stores; (2) the access patterns are key-based and graph-based, not relational-join-heavy; (3) the follow graph is read on every feed fetch, so it must be indexed for "give me everyone user X follows" in O(1).
The 200KB JPEGs themselves go to distributed object storage — S3 or HDFS. Each blob is named by its photo_id. Replication-3, multi-AZ, served via CDN. Never store these in a database.
Stored in a wide-column NoSQL store like Cassandra. Primary key = photo_id. Cassandra's quorum reads, multi-region replication, and tunable consistency are exactly what we need.
Wide-column friendly: one row per user, columns are the user_ids they follow. Reading "who does Raj follow?" becomes a single row read. Same store (Cassandra) keyed by user_id1.
NoSQL wide-column shines here. For a "user's photos" table, we use the user_id as row key and photo_ids as columns. Reading "all of Sarah's photos in reverse-chrono order" is a single row read instead of a fanout query. Same trick for the follow graph: row key = user, columns = users they follow.
This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. Numbers from §3 drive every decision.
The simplest possible system: one app server, one MySQL database, and we'll just be lazy and store photo BLOBs as BLOB columns in MySQL alongside the metadata. Uploads write to MySQL. Reads SELECT the BLOB column and stream it back to the client. Feed generation runs a giant join across users, follows, and photos at request time.
Three concrete failures emerge the moment real traffic shows up:
A single 200KB photo upload pins one MySQL connection for the duration of the byte transfer. At 23 uploads/sec each holding a connection for 1-2 seconds, write connections saturate. Read queries queue up behind them — feed latency spikes from 50ms to 5 seconds. The "uploads choke reads" failure is the first one a real workload exposes.
Relational DBs are tuned for small rows. A 200KB BLOB blows the row out of one page, fragments the buffer pool, and shreds throughput. At 1,425 TB over 10 years, we'd need a MySQL cluster the size of a data center just to hold the bytes — at 100× the cost of S3, with worse durability.
To build Raj's feed, the naive query is SELECT photos.* FROM follows JOIN photos ON ... WHERE follows.user = raj ORDER BY created DESC LIMIT 20. With 250 follows × thousands of photos each, this scans millions of rows on every feed open. Doing it under 200ms for 1M DAU? Impossible without pre-computation.
The single most important insight in this design is that three workloads have wildly different shapes and must scale independently. Pass 1 fails because all three live on the same box, fighting for the same CPU, disk, and connection pool.
23 req/sec, slow, bandwidth-bound. Each request streams a 200KB photo. The bottleneck is network and disk, not CPU. Latency budget is generous — users tolerate a 1-2 second upload spinner. Goal: never drop a byte, never lose a photo.
5,800+ req/sec, fast, latency-sensitive. Photos are immutable and cacheable up to the moon. Served from CDN edge POPs globally so a Tokyo user doesn't pay a trans-Pacific round trip. Goal: under 50ms for a cached photo, under 200ms for a feed.
Async, expensive, runs offline. Pre-computes per-user timelines so the feed read is a single cache lookup, not a triple join. Workers fan out new photos to followers' timelines in the background, decoupled from the request path.
This three-way split is THE central insight. Splitting these planes lets each scale independently and prevents uploads from breaking reads — which was the very first failure of Pass 1. Goodbye, blocked connection pool.
Now the full picture. Every node is numbered ①–⑫ — find its matching card below to see what it does, why it's there, and crucially what would break without it.
Use the numbers in the diagram above to find the matching card. Each one answers what is this thing, why is it here, and what would break without it.
The Instagram mobile app on Sarah's iPhone or the web client in Raj's browser. The client is responsible for compressing the photo locally before upload (resizing to a sane max resolution and JPEG-compressing) so the server doesn't waste bandwidth on raw camera bytes. It also caches recently-viewed photos in local memory so scrolling back up is instant.
Solves: nothing on its own — but every design choice flows backward from "what does the client experience?" Latency, cell-network reliability, and the upload protocol are all client-facing concerns.
The traffic cop. Sits in front of upload, read, and feed services, distributes incoming HTTPS, terminates TLS, and yanks unhealthy backends out of rotation via 5-second health checks. Round-robin is fine to start; switch to least-connections once you have enough nodes that load skew matters.
Solves: single-server failure and uneven load. Without an LB, one app crash takes down the service. With it, we lose 1/N of capacity for a few seconds until the bad node gets evicted.
Stateless service handling POST /api/v1/photos. Per request: validate the multipart payload, generate a unique photo_id, stream the bytes to Image Storage ④ in chunks, then write a metadata row to the Image Metadata DB ⑤. Finally it emits an async event to the Feed Generation Service ⑧ saying "user X just posted photo Y, fan it out to their followers" and another event to Search ⑫ to index the title. Stateless → we just add pods behind the LB to scale.
Solves: isolating the slow, bandwidth-heavy upload work from everything else. Without a dedicated upload tier, photo bytes streaming through the same connection pool as feed reads would block reads — the original Pass 1 failure.
The actual photo bytes — billions of 200KB blobs — live here. Each photo is written to a path keyed by photo_id, replicated 3× across availability zones, never deleted (only soft-deleted via metadata flag). At 1,425 TB over 10 years, this is the biggest bill in the system, but storage at this scale is dirt-cheap compared to running it on relational DB disks.
Solves: the "where do the bytes physically live" problem. A relational DB cannot economically hold 1.4 PB of opaque blobs. Object storage is purpose-built: cheap, durable (11-nines), automatically replicated, and directly addressable from a CDN.
A wide-column NoSQL store holding the photo table — photo_id, user_id, photo_path, lat, long, creation_date, title. Sharded by photo_id with consistent hashing, replicated 3× across AZs. Cassandra's R=2, W=2, N=3 quorum gives us strong-enough consistency without sacrificing availability. Total size: ~1.88 TB over 10 years — easy at this distribution.
Solves: indexing the photo blobs so we can answer "what photos does Sarah have?", "when was photo X uploaded?", "where was it taken?" without reading the bytes themselves. Without metadata, the blob store is opaque — we couldn't even find a photo without already knowing its path.
Stateless service handling GET /api/v1/photos/:photo_id. Per request: look up the photo metadata, then return a 302 redirect to the CDN ⑦ URL. For most photos, the client never even talks to this service again — the second hit comes straight from CDN. Scaled aggressively because read traffic is 250× write traffic.
Solves: serving the redirect at scale and acting as the gatekeeper for visibility / privacy checks before exposing the CDN URL. Without it, the upload service would have to also handle reads — combining the slow path with the hot path, which we've already said is fatal.
A globally-distributed edge network in 200+ cities. Photos are immutable and cache-friendly to a fault — once a CDN edge in Tokyo has the bytes, every Tokyo user's request is answered in 20ms with zero load on our origin. Hot photos (a celebrity's latest post) live in every edge POP within seconds.
Solves: global latency and origin bandwidth. Without a CDN, every photo view from Tokyo travels to our US-East data center and back — 200ms of speed-of-light penalty just on physics. And our origin would have to push 1.16 GB/s constantly. With a CDN, origin sees only the cold-tail traffic — maybe 5% of total reads.
The brain of the personalised stream. Async workers consume "new photo" events from the upload service, look up the uploader's followers in the User Graph DB ⑩, and push the new photo onto each follower's pre-computed timeline in the Timeline Cache ⑨. When a user opens the app, fetching their feed becomes a single Redis read — no joins, no scans. Hybrid push/pull (see §9) handles celebrity edge cases.
Solves: the 200ms feed latency requirement. Without pre-computation, every feed open would scan all of a user's follows' photos and rank them — impossible to do under 200ms for 1M DAU. By doing the work at write time once, we make every read O(1).
An in-memory key-value store keyed by user_id, whose value is the user's pre-computed timeline — a list of the most recent ~500 photo_ids (with metadata) from people they follow. Backed by Redis sorted sets (score = timestamp) so range queries by time are trivially fast.
Solves: the speed of feed reads. Memory is microsecond-fast; disk is millisecond-fast. With 1M DAU × ~10KB timeline each = 10GB — fits trivially in a small Redis cluster. Without this, every feed open hits Cassandra with a multi-key fanout query.
Holds the follow graph. Wide-column layout: row key = user_id, columns = users they follow (~125B rows total over 10 years). Reading "who does Raj follow?" is a single row read. Reading "who follows Sarah?" (the inverse) needs a denormalised second table — same layout, keyed by the followee.
Solves: graph queries on the follow relationship at scale. The feed generator can't fan a new photo out to followers without a fast "give me this user's followers" query, and a relational JOIN on a 125B-row table doesn't end well.
Pushes a phone notification ("Sarah just posted!") to followers who have notifications enabled. Consumes the same "new photo" event as feed generation, looks up notification preferences, and fires off APNS/FCM messages. Async — never on the upload path.
Solves: engagement (telling Raj a photo he'd care about exists). Without it, users would have to open the app to discover new content. Crucially separated from upload so notification backpressure can never slow uploads.
An inverted index of photo titles and captions, kept up to date by ingesting "new photo" events from the upload service. Handles GET /api/v1/search?q=sunset in tens of milliseconds. Photo metadata in Cassandra is the source of truth; Elasticsearch is purely a derived index.
Solves: the search-by-title functional requirement. Cassandra is great at key lookups but terrible at full-text search. Elasticsearch is purpose-built for it. Decoupled and async — losing the search index for an hour doesn't affect uploads or feeds.
Two real flows mapped to the numbered components above:
/api/v1/photos.photo_id = p_8f2a, streams the bytes to Image Storage ④ (S3), and writes the metadata row to Image Metadata DB ⑤.{ photo_id, cdn_url } to Sarah's phone. Total elapsed: ~1.2 seconds, mostly bandwidth.GET /api/v1/feed?user_id=raj.timeline:raj → gets back a list of 20 photo_ids including p_8f2a at the top. ~3ms.cdn_urls embedded. Raj's app fetches each photo from the CDN ⑦ — Tokyo edge POP, ~20ms each, fully parallel.Instagram's promise is "your wedding photo from 2014 will still be there in 2034". One byte lost is reputational damage forever. Reliability is non-negotiable — and it has a price tag, which we pay deliberately.
Every photo blob is written to three availability zones, in different buildings on different power grids. S3 promises 99.999999999% durability (11 nines) — at our scale that's about 1 photo lost per 10⁶ uploaded across the lifetime of the service. Acceptable.
Upload, read, and feed services hold no local state. Killing any pod loses zero data — the next request just hits a sibling pod. We run N+2 capacity per service so we can lose two pods at once and keep serving. Deployments are rolling: 10% of pods at a time, health-checked between waves.
Every service is deployed to multiple regions (US-East, EU, APAC), all serving live traffic. DNS routes a user to their nearest region. If a whole region falls over, traffic shifts to the others within minutes. Cassandra's multi-region replication keeps metadata consistent eventually.
1.88 TB of photo metadata fits on one box only on paper — at this volume we need to shard before we hit a single-node ceiling on either disk or QPS. The interesting question is what to shard on. Two strategies, and the trade-off matters.
Compute shard = hash(user_id) % N. All of a user's photos live on the same shard. Reading "Sarah's profile" is a single-shard query — fast. But three problems:
Compute shard = consistent_hash(photo_id) % N. Each photo lands on a uniformly-random shard. Distribution is even by construction — no hot shards.
Like the URL shortener's KGS, we want photo_ids that are globally unique without coordination. Use a 64-bit ID with the layout:
┌──────────────────┬───────────┬──────────┐ │ 41-bit timestamp │ 10-bit shard │ 13-bit seq │ └──────────────────┴───────────┴──────────┘
Bits 0-40 are milliseconds since epoch (good for ~70 years). Bits 41-50 identify the generator shard (1024 possible). Bits 51-63 are a per-millisecond counter (8192/ms per shard). This is essentially the Twitter Snowflake scheme. Big win: sorting photo_ids gives time-sorted access for free — useful for "the last 20 photos" queries.
The feed is Instagram's killer feature and the hardest engineering problem in the system. The question: when Sarah uploads a photo, when does it materialise in Raj's feed? Three strategies, and the right answer is "all three, depending on user".
Do nothing on upload. When Raj opens the app, query all 312 of his follows for their recent photos, merge by timestamp, return the top 20.
On every upload, push the photo onto every follower's pre-computed timeline. Reads are O(1).
Most users have a few hundred followers. Pushing 600 timeline writes per upload is cheap and keeps reads fast. But celebrities (Cristiano Ronaldo with 600M followers) break push entirely — one of his uploads would trigger 600M timeline writes, melting Redis.
The hybrid: classify each user as normal or celebrity (e.g., >1M followers). When Sarah (normal) uploads, push to her followers' timelines. When a celebrity uploads, don't push — instead, on every feed read, the feed service merges the user's pushed timeline with a pull-query of any celebrities they follow. Celebrities are few, so the pull-side merge stays cheap.
Each active user's timeline holds the top 500 most recent photos in Redis. Sliding window — when a new photo arrives, push it to the front, evict the oldest. Inactive users (no app open in 30 days) get their timeline torn down — we'll regenerate it lazily if they come back.
The single biggest performance lever in the system. Without caching, every photo view hits S3 and every feed open hits Cassandra. With the right cache layout, the origin sees a fraction of the actual user load.
CloudFront / Akamai caches the actual photo bytes at edge POPs in 200+ cities. Photos are immutable (once a photo_id is taken, the bytes never change), so the CDN can cache aggressively with a long TTL — days or weeks. The CDN has its own LRU; popular photos stay in every edge POP, cold photos roll off naturally.
In front of Cassandra, we run a Memcache cluster holding hot photo_id → metadata mappings and hot user profile rows. Eviction is LRU. Sized at ~10% of total metadata = ~200GB across a small cluster.
20% of photos generate 80% of view traffic. Caching just the hot 20% absorbs the bulk of reads. Specifically: cache the top 20% of daily volume, not the top 20% of all-time volume — yesterday's viral post is irrelevant today.
Public-facing AWS ALB / nginx. Distributes incoming HTTPS to upload, read, and feed pods. Health-checks every 5s, evicts unhealthy pods. Terminates TLS so backend pods don't pay the crypto cost.
Client-side or sidecar LB. Uses consistent hashing on the cache key so the same key always lands on the same Memcache node — maximises hit rate, avoids cache duplication.
Cassandra drivers handle this themselves — the client knows the cluster topology and routes each query to the coordinator node for the relevant token range.