From "JOIN every friend's posts on read" to a hybrid push/pull feed with ML ranking — the architecture, the trade-offs, and why every box earns its place
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
Imagine Sarah opens Facebook on her phone over morning coffee. The first thing she sees is a feed: Raj's vacation photo from Bali at the top, then her mom's birthday post (with 47 comments already), then her favorite cooking page's latest recipe video, then a friend's status from yesterday she missed. None of these are sorted by raw time. They're ranked — by what's most relevant to her, right now. Behind that one swipe is a system that picks 50 stories out of potentially 50,000 candidate posts, scores each one with an ML model, and assembles the result in under 2 seconds, 1.5 billion times a day.
Beyond just being chronological, the Newsfeed is a personalized newspaper printed for each reader — Raj's morning paper looks nothing like Sarah's, even though they share dozens of friends. The system has to handle text posts, photos, videos, and shared links from people, pages, and groups a user follows, and surface them in the order they're most likely to engage with.
Before drawing a single box, pin down what the system must do. The Newsfeed has more moving parts than most because each user's feed is unique, content types are heterogeneous, and the ranking is the product.
Numbers are not optional. The Newsfeed is read-heavy and fan-out heavy — both shape every architectural choice that follows.
Assume 300 million daily active users (DAU). Each user fetches their feed roughly 5 times per day (open the app, scroll, leave, come back). That's 1.5 billion feed fetches/day.
300M users/day
Active users who hit the feed
1.5B / day
5 fetches × 300M users
~17,500 req/sec
1.5B / 86,400
300 + 200
Friends + pages per user
The biggest design lever in this system is pre-computing each user's feed and caching it instead of building it from scratch on every request. If we cache the top 500 items per user at ~1KB each (post id + metadata + media URLs):
500 × 1KB = 500KB per user × 300M active users ≈ 150 TB of cache. At 100GB per cache server, that's ~1,500 cache servers in the user-feed cache tier. Big number, but Facebook-scale, and the alternative (rebuilding on every request) costs 100× more in compute.
If 10% of DAU posts something each day, that's 30 million new posts/day = ~350 posts/sec. Each post may fan out to hundreds of followers' feed caches — so the fan-out write rate is enormous: 350 × ~500 followers avg = ~175K cache writes/sec. This is why the fan-out service is a separate, async tier.
| Metric | Value | Why it matters |
|---|---|---|
| Feed fetch QPS | 17.5K/s | Drives read tier sizing & cache hit rate target |
| Per-user cache | 500 KB | 500 pre-ranked feed item ids + minimal metadata |
| Total cache | 150 TB | Forces sharding across ~1,500 Redis nodes |
| New posts/sec | ~350/s | Drives Post Service throughput |
| Fan-out writes/sec | ~175K/s | Forces async fan-out tier — can't run inline |
One endpoint carries the bulk of feed traffic — getUserFeed. Another writes posts. The read API is paginated using cursor-style since_id/max_id markers, not page numbers, because the feed grows under the user's feet.
// Read — feed fetch, ~17.5K QPS GET /api/v1/feed { "api_key": "abc123...", "user_id": 42, "since_id": "post_99812", // only items newer than this "count": 50, // page size "max_id": null, // for scrolling backward "exclude_replies": true } → 200 OK { "items": [ { "post_id": "...", "author_id": 88, "type": "image", "score": 0.91, ... }, ... ], "next_cursor": "post_98477" } // Write — new post, ~350 QPS POST /api/v1/posts { "api_key": "abc123...", "user_id": 42, "content": "Birthday photo from mom!", "media_ids": ["m_771", "m_772"], "audience": "friends" } → 201 Created { "post_id": "post_99813" }
since_id instead of page=N? The feed is a moving target — by the time a user scrolls to "page 3", the system has injected 12 new posts at the top, shifting everything down. Page numbers would show duplicates or skip items. A cursor anchored to since_id says "give me 50 items strictly older than this one I already saw", which is stable across scrolls.score on every item? Because the feed isn't strictly chronological — the client may want to display the score (rare) or use it for client-side re-ordering after the user marks a post as "not interested". The server is the source of truth for ranking, but exposing the score keeps the contract honest.Six core entities. Notice that the schema looks broadly relational — there are relationships (a post belongs to a user, a follow links two users) — but at our scale we'll partition each table aggressively and skip cross-table joins entirely. Most lookups are by primary key.
USER holds people accounts. ENTITY holds pages and groups — anything a user can follow that isn't itself a user. We could merge these, but separating them lets us evolve their lifecycle independently (a Page has admins, monetization, and verification flags that a User does not).
The follow graph. (user_id, target_id, target_type) — so Sarah follows person 88, page 451, and group 99 in three rows. target_type distinguishes them. This table is read on every fan-out: "who follows post-author 42?" and "which entities does Sarah follow?". Indexed both directions.
The post itself. PK is post_id (Snowflake-style: timestamp + shard + sequence, monotonically increasing). Stored in Cassandra, sharded by post_id for uniform write distribution. The content text lives here; media is referenced by id.
One post can have many media files (a 5-photo album), so we model it as a join: FEED_MEDIA links post_id ↔ media_id, and MEDIA stores the metadata (s3 URL, dimensions, MIME). The actual bytes live in S3 and are served via CDN — never through our app servers.
This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. The capacity numbers from §3 drive every decision.
Sketch the simplest possible system: when Sarah opens Facebook, an app server runs a single SQL query — "give me all posts authored by users Sarah follows, ordered by date, limit 100". One database holds everything.
Four concrete failures emerge the moment real traffic shows up:
Sarah follows 300 people + 200 pages = 500 sources. The query becomes WHERE author_id IN (... 500 ids ...). The DB has to gather, merge-sort, and rank posts from 500 partitions for every single fetch. At 17K req/sec each fanning into 500 sources, the DB is doing 8.5M sub-lookups per second. Latency goes from 50ms to 5+ seconds — feed never loads.
No cache means 17,500 req/sec slamming the database directly. Each query touches hundreds of rows. A single Postgres node tops out around 5K simple SELECTs/sec; we need thousands of complex multi-source queries/sec. The DB CPU pegs and p99 explodes.
Cristiano Ronaldo posts. He has 600M followers. If we tried to push his post to each follower's feed inline, that's 600M cache writes triggered by one HTTP POST. The post would take 30+ minutes to "publish", and during that time the fan-out service blocks every other write.
Even if we could serve the SQL fast, sorting purely by created_at shows Sarah a stranger-in-her-group's promo post above her mom's birthday photo. Engagement collapses. The feed must be ranked by relevance — and ranking adds 50ms+ of ML scoring per item, making Pass 1 even less viable.
The single insight that unlocks the whole design: feeds are read 100× more often than they're written to. So move the work to write time. When a user posts, fan out to all their followers' pre-computed feed caches asynchronously. When a user opens the app, the feed is just a single cache fetch — already assembled, already ranked.
Used for normal users. When mom posts a birthday photo, the system writes the post id into all 200 followers' feed caches. Read becomes O(1). Works beautifully when fan-out per post is bounded — say under 10K followers.
Used for celebrities. Ronaldo's posts are not pushed. Instead, on every feed read, the system fetches Ronaldo's 50 most-recent posts directly from the post DB and merges them with the reader's pre-computed cache. Write becomes O(1) regardless of follower count.
The hybrid model wins. Push for the long tail of normal users; pull for the head of celebrities; merge both at read time. This is the same pattern Twitter uses for its home timeline, but Facebook adds a second layer — ML ranking — so the merged set is then scored and re-ordered before returning the top 50.
Now the full picture. Every node is numbered — find its matching card below to see what it does and what would break without it.
Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.
The Facebook app or web page. Issues a GET /feed when the user opens the app or pulls down to refresh, and a POST /posts when they publish. Holds a small client-side cache of the last fetched feed so a quick re-open doesn't always hit the network. Maintains a websocket for real-time notifications.
Solves: nothing on its own — but every latency budget flows backward from "what does the client experience?" The 2-second feed-load goal is measured from the client's tap to first paint, not from the server's response.
Public-facing layer-7 LB (e.g., AWS ALB or in-house equivalent). Distributes incoming HTTPS across web-server pods, terminates TLS, and yanks unhealthy backends via 5-second health checks. Sticky sessions are not needed because everything below is stateless.
Solves: single-server failure and traffic-spike absorption. Without it, one app crash takes down the whole region. With it, we lose 1/N of capacity for a few seconds until health checks kick the bad pod out.
Stateless edge service that handles the HTTP/HTTPS surface — auth, request validation, rate-limit enforcement, response compression. Hands the parsed request off to the application server tier. Easy to scale horizontally because no state lives here.
Solves: isolating protocol concerns (TLS, HTTP parsing, auth) from business logic. Without this split, every app-server pod would have to ship a TLS termination stack and rate-limit Redis client — wasteful and harder to upgrade.
The orchestrator. For a feed fetch, it calls Feed Generation, applies user-level filters (muted authors, language preference), and returns the response. For a new post, it forwards to Post Service. Holds zero per-user state — every request is independent.
Solves: a clean seam between "what the client asked for" and "which downstream services do the work". Without it, web servers would have to know about Cassandra, Redis, S3, and the ranking model — coupling that breaks every refactor.
Handles the POST /posts write path. Validates content, generates a Snowflake-style post_id, writes the row to FeedItem DB, uploads any media to S3, and emits a "post created" event onto the fan-out queue. Latency budget: under 200ms — the user is staring at a spinner.
Solves: isolating the write path so it scales independently of reads. Without a dedicated post tier, write spikes (e.g., a major news event) would compete with feed-read traffic on the same pods, and reads would slow during the spike.
The source of truth for posts — a partitioned, replicated NoSQL store sharded by post_id. Cassandra was chosen because the access pattern is "look up post by id" + "list recent posts by author" — both indexable, neither needing joins. Replicated 3× across availability zones with quorum reads/writes for durability without sacrificing availability.
Solves: durable post storage at petabyte scale. A single MySQL box maxes out around 1TB; we have decades of posts to keep. Cassandra spreads it across hundreds of nodes and survives single-node failure transparently.
Photos and videos go to S3. URLs are then cached at edge POPs by a CDN (CloudFront / Akamai). When Sarah's feed includes mom's photo, the feed response carries the CDN URL — Sarah's browser fetches the bytes directly from her nearest edge node, never through our app servers.
Solves: the bandwidth problem. A single 4MB photo viewed by 10K followers is 40GB of egress. Pushing that through our app-server tier would saturate it instantly. The CDN absorbs it at the edge — that's the entire reason CDNs exist.
The async heart of the push side. Consumes "post created" events from a Kafka topic. For each event: look up the author's followers, filter out anyone above the celebrity threshold, and write the new post_id into each remaining follower's User Feed Cache. Capped at e.g. 500 entries per user, oldest evicted. Runs on hundreds of worker pods — embarrassingly parallel across posts.
Solves: the "make new posts visible to followers fast" requirement without blocking the publish HTTP request. Without async fan-out, mom's post couldn't return 201 until 200 cache writes completed — slow on the happy path and catastrophic if any cache shard is briefly unhealthy.
The crown jewel. Per-user pre-computed list of feed-item ids — a Redis sorted set keyed by user:{id}:feed, scored by post timestamp, capped at 500 entries. Sharded by user_id across ~1,500 nodes. A feed fetch is one ZREVRANGE on the right shard — sub-millisecond.
Solves: the 17.5K-req/sec read load. Without this cache, every fetch would hit Cassandra with a fan-in query across hundreds of authors — orders of magnitude slower and impossible to scale linearly. The cache makes the read path effectively free.
The read-path orchestrator. For each feed fetch: (1) ZREVRANGE the user's pre-computed cache for the top 500 ids; (2) for each celebrity the user follows, SELECT the celebrity's last 50 posts from FeedItem DB; (3) merge both sets; (4) hand to Ranking Service; (5) hydrate the top-N with full post bodies and media URLs. Total budget: under 500ms.
Solves: the merge-and-rank step that the cache alone can't do. The cache holds candidates; ranking picks the winners. This service also handles cache misses (regenerate on the fly for inactive users) and pagination.
The ML brain. Given a candidate list of post ids and a viewer id, returns each post's relevance score — a function of recency, post engagement (likes, comments, shares), viewer-author affinity (how often Sarah interacts with mom), content type (Sarah loves videos), and predicted dwell time. Implemented as a model server (e.g., TorchServe) with a feature store. Scoring 500 candidates takes ~30-50ms.
Solves: the relevance problem. Without ranking, the feed is reverse-chronological — and reverse-chrono is empirically worse for engagement than ranked. Personalization is the product.
Pushes "your friend just posted" alerts to active users via websocket / APNs / FCM. Subscribes to the same fan-out events as the cache writers, but emits user-visible notifications instead of cache mutations. Throttled per-user so a friend who posts 10 times in an hour doesn't generate 10 buzz notifications.
Solves: the "new post visible within 5 seconds" goal for users who already have the app open. Without it, fresh posts only appear when the user manually refreshes — defeating the freshness requirement.
Two real flows mapped to the numbered components above.
/api/v1/posts with text + photo bytes.post_id = post_99813, writes {post_id, user_id, content, media_id, created_at} into FeedItem DB ⑥.201 {post_id} to mom. Total: ~250ms — mom sees her post on screen.post_99813 into each one's User Feed Cache ⑨ via ZADD user:{id}:feed. Total fan-out: under 5 seconds./api/v1/feed with her user_id and a since_id cursor.ZREVRANGE user:42:feed 0 499 against User Feed Cache ⑨ → 500 candidate post ids back, including mom's brand-new post_99813.The core trade-off: do we keep a pre-computed feed for every user 24/7, or only for active ones? Storage vs. compute.
For users who open Facebook at least once a week, we maintain a continuously-updated feed cache. Every time the fan-out service receives a new post from someone they follow, the user's cache is updated immediately. By the time Sarah opens the app, her feed is already assembled — the read path is just a cache fetch + ranking.
Top 500 candidates (post ids + lightweight metadata). 500 × ~1KB = ~500KB per user. Why 500? Because even a heavy scroller rarely goes past 200 items in one session, and 500 gives us headroom for ranking to discard half of them.
The fan-out service is the only writer. It does ZADD user:{id}:feed score=timestamp post_id and ZREMRANGEBYRANK user:{id}:feed 0 -501 to keep the cap at 500. Old entries are evicted automatically as new ones arrive.
What about a user who hasn't opened Facebook in 6 months? Keeping their feed cache live is wasted memory. We let inactive users fall out of cache via an LRU policy on the cache cluster — if their entry hasn't been read in 30 days, it's evicted. When they finally return, the Feed Generation Service detects the cache miss and rebuilds the feed on demand: query FeedItem DB for the last 200 posts from each followed entity, merge, rank, return. Slower (~3-5 seconds) but only happens once per dormant user per session.
An optimization: when a dormant user logs in, the auth flow fires an async "warm cache" job in parallel with serving the login response. By the time the user navigates to the feed (a second or two later), the cache may already be populated, hiding the cold-start latency.
This is the question every Newsfeed/timeline interview circles back to. The honest answer is "hybrid" — but you only earn that answer by walking through the two extremes first.
How: on every feed read, query each followed entity's recent posts and merge.
Pros: trivially handles celebrities; no wasted work for dormant users; fresh by definition.
Cons: read latency scales with follower count — Sarah's 500 follows = 500 sub-queries per fetch. At 17.5K req/sec, that's 8.75M sub-queries/sec slamming the DB.
How: on every post, write the post id into all followers' pre-computed feed caches.
Pros: read is O(1) — just a cache fetch.
Cons: celebrities are catastrophic — Ronaldo's post triggers 600M cache writes. Also wasteful: dormant users' caches are constantly updated for posts they'll never see.
How: push for users with under 1M followers, pull for celebrities, merge at read time.
Pros: bounded fan-out cost per write, bounded read cost (cache + small celeb pull).
Cons: two code paths, threshold tuning, complexity in the merge step.
| Author follower count | Strategy | Why |
|---|---|---|
< 10K | Push to all | Fan-out is cheap; bounded cost |
10K – 1M | Push to active followers only | Skip dormant followers via the same LRU rule from §7 — saves memory without hurting any active user |
> 1M (celebrity) | Pull on read | Push would be 1M+ cache writes per post — too expensive |
The feed isn't sorted by created_at DESC. It's sorted by a relevance score computed per (viewer, post) pair — meaning Sarah and Raj could both follow mom and both see her birthday post, but at different positions in their respective feeds.
Decay factor on now - created_at. A post from 30 minutes ago beats one from yesterday, all else equal. Half-life around a few hours for friends, longer for groups.
Total likes, comments, shares. Higher engagement = the network has signaled this is interesting. Normalized so a celebrity's 10K likes don't always beat a friend's 50.
How often does Sarah interact with this author? Mom (Sarah's mom, lots of likes) scores way higher than a high-school acquaintance she never engages with. Computed offline daily and fetched from a feature store.
Sarah watches videos to completion; Raj scrolls past them. The model learns each viewer's content-type lean and boosts matching content.
The model predicts how long Sarah will stop on this post. Long predicted dwell = boost. Used to penalize clickbait that gets clicks but no read-time.
Does Sarah hide posts from this author? Has she marked similar content "not interested"? Has the author been recently rate-limited for low-quality content? All subtract from the score.
The Feed Generation Service hands the Ranking Service a batch — viewer features + 500 candidate post features. The model is a gradient-boosted tree or a small neural network served by a model server (TorchServe, TensorFlow Serving). Scoring 500 candidates in a single batch takes ~30-50ms on a modern GPU/CPU. The top 50 by score come back; the rest are discarded.
Two storage tiers, two different partitioning keys. Choosing the wrong one for either tier breaks scaling.
post_idPosts are written ~350/sec from millions of authors. Sharding by post_id (Snowflake-style id, includes timestamp + shard hint) gives uniform write distribution — no hot author can melt one shard. Reads are by post_id, which is also fast.
What about "list posts by author"? Use a secondary index on (author_id, created_at) — Cassandra can index it natively. Slightly slower than primary-key reads but only used for the celebrity pull path, which is the long-tail traffic.
user_idEach user's pre-computed feed lives entirely on one shard, keyed by user:{user_id}:feed. Why not shard by post? Because the read pattern is "give me Sarah's feed" — single-user, single-key. Spreading Sarah's feed across multiple shards would force a scatter-gather on every read.
Consistent hashing on user_id across the ~1,500-node Redis cluster. Adding/removing nodes only relocates 1/N of users instead of all of them.
Each Cassandra shard is replicated 3× across availability zones with quorum writes (W=2, N=3) and quorum reads (R=2). Each Redis cache node has a hot replica — if the primary dies, the replica takes over in seconds; we lose at most a few seconds of fan-out writes that hadn't replicated yet, which is acceptable for a feed cache.
The cache isn't one box; it's a layered hierarchy. Each tier catches a different class of repeat work.
On each Feed Generation pod, an in-memory LRU keyed by user_id with a 30-second TTL. If Sarah refreshes twice in a row, the second refresh hits the in-process cache and skips Redis entirely. Tiny (a few hundred MB per pod) but kills millisecond-level repeat-fetch noise.
The big one. The 150TB sharded Redis cluster from §6 holding pre-computed feeds. Hit rate target: >95%. LRU eviction at the cluster level for inactive users.
Last resort — only hit on cache miss (cold user) or for the celebrity pull path. Not technically a cache, but the source of truth that the upper tiers fall back on. Sized to handle the long-tail miss rate, not the full 17.5K req/sec.
When a user logs in, the auth service emits an async "user is active" event. The Feed Generation Service consumes it and pre-warms the in-process LRU for that user before they navigate to the feed. Hides the cold-start latency.
SELECT for Ronaldo's last 50 posts and merges them with the follower's cached feed. Post is delivered in the next read; latency budget unaffected because the celebrity-pull queries are tiny.ZADD — typically all 200-500 follower caches updated in 1-3 seconds. We monitor end-to-end p99 fan-out latency; if it creeps past 5 seconds we add workers. For active followers with the app open, the Notification Service also pushes a websocket message so they see "1 new post" instantly.ZREVRANGE user:42:feed against Redis. On miss (empty result), it falls back to: (1) query FeedItem DB for the last 200 posts from each entity user 42 follows; (2) merge, sort by timestamp, take top 500; (3) write the result back to Redis as the user's new feed cache; (4) hand to ranking and return. Latency: 3-5 seconds vs. 800ms warm. Acceptable because miss rate is <5% (only dormant users).