Distributed Cache (Memcached / Redis Cluster)

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →

Step 1

What is a Distributed Cache?

It's 7:30pm on Cyber Monday. A new product page goes viral on TikTok and the link starts pinging Maya's e-commerce site at 1 million requests per second. Without a cache, every single page load runs the same Postgres query — SELECT * FROM products WHERE id = 42 — and a single Postgres box that comfortably serves 20K reads per second now drowns in fifty times its rated load. The DB CPU pegs at 100%, queries pile up in the wait queue, and within 90 seconds the entire shop returns 503s. A multi-million-dollar sales day, gone.

Now imagine the same scenario with a distributed cache sitting in front of Postgres. The first user who loads the page runs the DB query — 8 milliseconds — and the result gets stashed in the cache. Every other user for the next hour gets the cached copy in 0.4 milliseconds, served straight from RAM on a cache server. 99% of those 1M requests never touch Postgres at all. The DB sees its normal trickle, the page stays fast, and Maya keeps every dollar.

A distributed cache is, in plain English, a network-accessible, in-memory key-value store that sits between your application and your slower database. "Distributed" because no single box has enough RAM (or enough network bandwidth) to serve all the traffic — so we spread the data across many boxes and route each key to the right one. Memcached and Redis Cluster are the two most famous open-source examples; AWS ElastiCache and Google Memorystore are the managed equivalents.

The two questions that drive every design decision below: (1) How do we split the data across N boxes so that adding or removing a box doesn't melt the cluster? (2) How do we keep the cache "fresh enough" so users don't see stale prices, stale inventory, or yesterday's tweet count?

Step 2

Requirements & Goals

Before drawing a single ring or arrow, pin down what the cache must do — and what it explicitly will not do. A cache is not a database; setting that boundary up front prevents half the design mistakes.

✅ Functional Requirements

GET key — fetch a value by key, return null on miss
SET key value [TTL] — store a value, optionally with time-to-live
DELETE key — explicit invalidation
Batch ops — MGET k1 k2 k3, MSET — amortize network round trips
Atomic ops — INCR key, CAS key version value for counters and optimistic locks
TTL-based expiration so stale entries clean themselves up
Optional: pub/sub channels for fan-out notifications (Redis only)

⚙️ Non-Functional Requirements

Sub-millisecond latency p99 — anything slower defeats the point of a cache
1M+ QPS per node, scaling linearly to 100M+ across the cluster
Horizontally scalable to hundreds of nodes
99.99% availability — any single node can die without taking the cluster down
Fault-tolerant — no permanent data loss for replicated entries

🚫 Out of Scope

Strong durability — caches are not the source of truth
Complex queries, joins, secondary indexes
ACID transactions across multiple keys

The mental anchor: a cache trades durability for speed. If the cluster lost everything tomorrow, the application should keep working — slower, hammering the DB, but working. Anything you cannot afford to lose belongs in the database, not the cache.

Step 3

Capacity Estimation & Constraints

Numbers drive every box on the diagram. Let's size for a realistic large-scale deployment — what you'd find at a top-100 web property.

Traffic & data targets

Target: 100 million requests per second across the cluster, with 1 TB of total cached data, ~95% read / 5% write split (typical for read-through caches). Average value size: 1 KB.

Cluster QPS

100M req/sec

Aggregate read+write

Per-Node QPS

~200K req/sec

Headroom for hot shards

Per-Node RAM

64 GB

1 TB / 16 primaries

Cluster Size

32 nodes

16 primary + 16 replica

Network bandwidth

200K req/sec × 1 KB/value = 200 MB/sec per node, or ~1.6 Gbps. That fits comfortably within a 10 Gbps NIC — but it's a real number that constrains how hot a single shard can get before we have to split it.

Why 16 primaries (not 8, not 32)?

Pick the smallest cluster that meets every constraint with headroom. 8 nodes × 200K = 1.6M QPS — way under target. 32 nodes — overspending and harder to operate. 16 nodes × ~200K = 3.2M sustained, with bursting room to 100M during traffic spikes assisted by the application's own request pipelining and batching. Each primary gets a replica, so the operational cluster is 32 nodes total.

Metric	Value	Why it matters
Total cached data	`1 TB`	Forces sharding — no single box has 1 TB of RAM
Per-node RAM	`64 GB`	Standard cache instance class (r6g.2xlarge, etc.)
Cluster QPS	`100M`	Drives node count and replication degree
p99 latency target	`under 1ms`	Forces in-memory storage and same-AZ deployment
Replicas per primary	`1-2`	Survives node + AZ failure without data loss

Step 4

System APIs

The Redis/Memcached protocol is delightfully boring, which is the point — clients connect over a single TCP socket, send text commands, and parse text responses. No HTTP overhead, no JSON, no auth handshake on every call. The protocol is so thin that a single CPU core can serve 200K+ commands/sec.

Cache command surface

// Read — the bread and butter
GET user:42                       // → "{name:'Sarah',email:'...'}"  or null
MGET user:42 user:43 user:44      // batch — one round trip for N keys

// Write — populate or update
SET user:42 "" EX 3600     // EX = expire in seconds
MSET user:42 "v1" user:43 "v2"    // batch write
SETNX lock:order:99 "owner-A"     // SET if Not eXists — atomic lock

// Delete / invalidate
DEL user:42                       // remove a key

// Atomic counters — the killer feature
INCR page:home:views              // atomic +1, returns new value
INCRBY rate:user:42 1             // rate-limit token-bucket math

// Compare-and-set (optimistic lock)
GETS cart:99                      // returns value + version (CAS token)
CAS cart:99 v=37 ""    // succeeds only if version still 37

// Pub/sub (Redis only — fan-out notifications)
PUBLISH channel:orders "order-99-paid"
SUBSCRIBE channel:orders

Why MGET matters more than people realize: rendering a single web page often needs 20-50 cache lookups (user, profile, recent orders, friend list, ...). Doing them one by one is 20 round trips × 0.5ms = 10ms wasted on network. MGET bundles them into one round trip — 0.6ms total. That's a 16× speedup at zero cost. Always batch.

Why CAS exists: two web requests both load cart:99, both add an item, both write back — last write wins, one item silently lost. CAS includes a version token in the read; the write only succeeds if the version still matches. Race resolved without a heavyweight lock.

Step 5 · CORE

High-Level Architecture — From Naive to Production

This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. The numbers from §3 drive every decision.

Pass 1 — The naive design (and why it breaks)

Sketch the simplest possible cache: one Redis box that the entire application connects to. App servers send GET/SET commands over TCP, the box answers from RAM, everyone goes home happy.

flowchart LR A1["App Server 1"] --> R[("Single Redis Box")] A2["App Server 2"] --> R A3["App Server N"] --> R R --> DB[("Backing DB")]

Four concrete failures emerge the moment real traffic shows up:

💥 One pipe, 100K QPS ceiling

Every app server in the fleet hits the same box. A single 10 Gbps NIC tops out around 100K requests/sec at 1 KB values once you account for TCP overhead and CPU pinning. Need 100M? You'd need a 10 Tbps link to one machine — which doesn't exist.

💥 256 GB RAM ceiling

Even high-end cache instances cap at ~256 GB RAM. We need 1 TB. There is no single machine you can buy that holds the dataset — physics says we must shard.

💥 One crash = total cache loss

The box reboots after a kernel panic. RAM clears. Suddenly every app server starts hitting the backing DB on every request. Postgres, sized for 5% of traffic, drowns under 100% — a "cache stampede" cascade outage that takes down the whole service.

💥 Cold-start thundering herd

Day 1 of a new deploy: cache is empty, every app request misses. Backing DB sees the full 100M req/sec for the first few minutes — long enough to crash it. Same problem hits a fresh node added to expand capacity.

Pass 2 — The mental model: Sharded ring with replicated shards

The single most important set of insights in distributed cache design boils down to three ideas working together. Internalize these and the rest of the architecture writes itself.

🔑 1. Sharding via consistent hashing

Distribute keys across N nodes by hashing each key onto a circular ring. Each node owns an arc of the ring. The magic property: adding a node only relocates 1/N of keys — vs. nearly 100% with naive hash % N. This is what lets us grow the cluster without an outage.

🛡️ 2. Each shard replicated 1-2x

Every primary node has a replica that mirrors its data via async replication. When a primary dies, the replica gets promoted in seconds. The shard's slice of the cache survives the failure — no cold-start, no DB stampede.

🧭 3. Smart routing

Either the client driver knows the ring (Redis Cluster's "smart client") and connects directly to the right shard, or a thin routing proxy (Twemproxy, Envoy) fronts the cluster. Either way, no extra hop — keys land on their owning shard in one network call.

The ring + replication + smart routing trio gives us horizontal scale, fault tolerance, and sub-millisecond latency simultaneously. Every other component on the production diagram is in service of one of these three ideas.

The clock-face analogy: picture the ring as a 24-hour clock. We hash node IDs onto the clock face — node A lands at 3 o'clock, node B at 9, node C at 11. We hash each key the same way; whichever node sits clockwise from the key owns it. Add a fourth node at 6 — it steals the 3-to-6 arc from B; everything else stays put. Remove a node — its arc gets absorbed by its clockwise neighbor. Tiny rebalancing, no cluster-wide reshuffle.

Pass 3 — The production shape

Now the full picture. Every node is numbered — find its matching card below to see what it does and what would break without it.

flowchart TB CL["① Client App — Smart Driver"] CAL["⑪ Cache-Aside Layer (in app code)"] CL --> CAL PROXY["② Routing Proxy — Twemproxy / Envoy (optional)"] CAL --> PROXY TOPO["③ Topology Service — ZooKeeper / etcd / Sentinel"] CAL -.refresh ring.-> TOPO PROXY -.refresh ring.-> TOPO subgraph CACHE["Cache Tier"] direction LR P1["④ Primary 1 — arc 0-90°"] P2["④ Primary 2 — arc 90-180°"] P3["④ Primary 3 — arc 180-270°"] P4["④ Primary 4 — arc 270-360°"] end PROXY --> P1 PROXY --> P2 PROXY --> P3 PROXY --> P4 subgraph REPL["Replicas"] direction LR R1["⑤ Replica 1"] R2["⑤ Replica 2"] R3["⑤ Replica 3"] R4["⑤ Replica 4"] end P1 -.async repl.-> R1 P2 -.async repl.-> R2 P3 -.async repl.-> R3 P4 -.async repl.-> R4 EVICT["⑥ Eviction Manager — LRU/LFU/TTL"] P1 -.- EVICT PERSIST[("⑦ Persistence — RDB + AOF")] P1 -.snapshot.-> PERSIST SENT["⑧ Failover Coordinator — Sentinel"] SENT -.health.-> P1 SENT -.health.-> P2 SENT -.promote.-> R1 SENT -.update.-> TOPO METRIC["⑨ Metrics — hit rate, evictions"] P1 -.- METRIC DB[("⑩ Backing Store — Postgres / Cassandra")] CAL -.miss path.-> DB style CL fill:#e8743b,stroke:#e8743b,color:#fff style CAL fill:#171d27,stroke:#e8743b,color:#d4dae5 style PROXY fill:#171d27,stroke:#9b72cf,color:#d4dae5 style TOPO fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style P1 fill:#171d27,stroke:#4a90d9,color:#d4dae5 style P2 fill:#171d27,stroke:#4a90d9,color:#d4dae5 style P3 fill:#171d27,stroke:#4a90d9,color:#d4dae5 style P4 fill:#171d27,stroke:#4a90d9,color:#d4dae5 style R1 fill:#171d27,stroke:#38b265,color:#d4dae5 style R2 fill:#171d27,stroke:#38b265,color:#d4dae5 style R3 fill:#171d27,stroke:#38b265,color:#d4dae5 style R4 fill:#171d27,stroke:#38b265,color:#d4dae5 style EVICT fill:#171d27,stroke:#d4a838,color:#d4dae5 style PERSIST fill:#171d27,stroke:#d4a838,color:#d4dae5 style SENT fill:#171d27,stroke:#e05252,color:#d4dae5 style METRIC fill:#171d27,stroke:#d4a838,color:#d4dae5 style DB fill:#171d27,stroke:#38b265,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.

① Client App with Smart Driver

The application — a web server, mobile API gateway, batch worker — that wants to read or write a cache entry. The "smart driver" (e.g., Jedis, lettuce, redis-py-cluster) holds an in-memory copy of the ring topology. When the app calls GET user:42, the driver hashes user:42, looks up which primary owns that arc of the ring, and opens a TCP connection straight to that node — no extra hop, no proxy needed.

Solves: the routing problem with zero added latency. Without smart drivers, every client would either need to know all nodes (and risk hitting the wrong one) or pay for a proxy hop. Smart drivers make the cluster look like a single logical cache to the application.

⑪ Cache-Aside Layer (in app code)

A thin abstraction inside the application that implements the cache-aside pattern: try cache first, on miss fetch from DB and populate the cache, return the value. On writes, update the DB then DEL the cache key (don't try to update the cache value directly — that's a race condition waiting to happen). This is application code, not infrastructure, but it's drawn in the architecture because the choice of pattern materially shapes the system.

Solves: the "cache is just a side store" mental model. Without an explicit pattern, developers do ad-hoc caching, forget invalidation, and ship stale-data bugs. Having one well-tested helper (getOrFetch(key, () -> db.load(key))) makes correctness the default.

② Routing Proxy (Twemproxy / Envoy)

An optional thin TCP proxy that fronts the cluster for clients that don't have smart drivers (legacy apps, scripting languages, multi-language polyglot fleets). The proxy reads the topology from ZooKeeper/etcd, accepts plain Redis or Memcached protocol, and forwards each command to the right shard. Adds ~0.1-0.3ms per call.

Solves: heterogeneous client environments. A 200-microservice fleet in 8 languages can't all upgrade to smart drivers simultaneously — the proxy lets them keep using a vanilla Redis client and still get sharded routing. Without it, you'd either run 200 driver upgrade projects or accept dumb routing.

③ Cluster Topology Service

A small, highly-available coordination service (ZooKeeper, etcd, Consul, or Redis Sentinel for Redis-specific deployments) that holds the canonical mapping from ring arc → primary node → replica node. Smart drivers and proxies refresh this every few seconds (or on a routing miss). When Sentinel promotes a replica during a failover, it updates the topology here, and clients pick up the change on their next refresh.

Solves: the "where does this key live right now?" problem in a cluster where membership is changing. Without it, a client whose driver was cached an hour ago would route to a dead node forever. Topology services are the cluster's nervous system.

④ Cache Node Cluster (Primary nodes)

The 16 primary nodes that actually hold the cached data. Each owns an arc of the consistent-hash ring — typically 1/16th of the keyspace, but adjusted with virtual nodes (~256 vnodes per physical node) for uniform distribution. Each primary holds ~64 GB of data and serves ~200K req/sec on commodity hardware. Single-threaded event loop (Redis) or multi-threaded with per-thread cache (Memcached) — both squeeze enormous throughput out of one CPU socket.

Solves: the throughput-and-RAM ceiling from Pass 1. 16 primaries collectively serve 100M req/sec and hold 1 TB. Without sharding, you'd need a single machine with impossible specs.

⑤ Replica Nodes

Each primary has 1-2 replicas in different availability zones. The replica subscribes to the primary's write stream — every SET, DEL, INCR is asynchronously shipped over and replayed on the replica. Lag is typically under 100ms. When a primary dies, the failover coordinator promotes the replica and the cluster keeps serving from RAM — no DB stampede.

Solves: the "one crash = total cache loss" failure mode. With replication, a primary death loses 0 - 100ms of in-flight writes (acceptable for a cache), and the replicated 64 GB of hot data stays warm. Cluster availability stays at 99.99%.

⑥ Eviction Manager

A per-node component that runs when memory pressure crosses a configured threshold (e.g., 95% of maxmemory). It picks victims based on the configured policy — LRU (default), LFU, TTL-first, or random — and removes them to make room for incoming writes. Implemented internally as a sampled probabilistic algorithm (Redis samples N keys and evicts the worst), not a strict global LRU, because true LRU would require a doubly-linked list of every key in RAM.

Solves: the bounded-memory problem. A cache with no eviction either OOMs or starts rejecting writes — both of which break the application. Eviction lets the cache always accept new entries while gracefully shedding cold ones.

⑦ Persistence Layer

Optional disk-backed storage for crash recovery (Redis supports it; Memcached does not). Two modes: RDB snapshots dump a point-in-time copy of all keys to disk every N minutes; AOF (append-only file) logs every write command so the cache can replay them on restart. Most production deployments run AOF + occasional RDB.

Solves: the cold-start problem after a planned restart or a region-wide power loss. Without persistence, a restarted node serves zero traffic for the first 10 minutes while organic reads slowly repopulate it — and the backing DB pays for that warm-up. With AOF, the node restarts already-warm.

⑧ Failover Coordinator (Sentinel)

A small cluster of 3-5 monitor processes that ping every primary every second. When a quorum of Sentinels agrees a primary is down, they pick the most up-to-date replica, promote it (it becomes the new primary), update the topology service, and notify clients to refresh. Failover typically completes in 5-30 seconds.

Solves: automated failover. Without it, a dead primary requires a human to manually promote a replica and update DNS — minutes-to-hours of partial outage. Sentinel turns that into seconds, fully automated.

⑨ Metrics & Monitoring

Per-node telemetry pushed to Prometheus / Datadog / CloudWatch. The signals that matter: hit rate (cached / total reads — under 90% is a smell), eviction rate (high evictions = under-provisioned RAM), memory pressure (close to maxmemory = imminent eviction storm), p99 latency (sudden spike = hot key or slow command), replication lag (high lag = imminent data loss on failover).

Solves: visibility. Without metrics, you find out the cache is broken when the application starts timing out — minutes after it actually stopped working. With them, alerts fire before users notice.

⑩ Backing Store (the DB)

The source of truth — Postgres, Cassandra, MySQL, whatever the application's primary store happens to be. The cache fronts it; on cache miss the app falls back here. Sized for the cold-tail traffic that misses the cache (typically 5-10% of total reads) plus the full write rate, not the full read rate. This is the entire reason caches exist: they let you size the DB for "miss traffic" instead of "all traffic", which is often a 10-20× cost savings.

Solves: nothing on its own — but the cache exists to protect it. Without the DB, the cache has no source of truth; without the cache, the DB has to be 10× bigger.

Concrete walkthrough — Sarah loads a product page, then a node dies

Two real flows mapped to the numbered components above:

📖 Read flow — Sarah loads product 42 at 14:02:06

Sarah's browser hits the app server. The Cache-Aside Layer ⑪ in the app code calls cache.get("product:42").
The Smart Driver ① hashes product:42 → ring position 142° → owned by Primary 2 ④ (arc 90-180°).
Driver opens (or reuses) a TCP connection to Primary 2, sends GET product:42. Primary 2 looks up the key in its hash table — HIT — returns the value in 0.4 ms.
Cache-Aside Layer returns the value to the app, which renders the page. Sarah sees product 42 in under 50 ms total end-to-end.
If it had been a miss: Cache-Aside Layer would call Backing Store ⑩ (Postgres) — 8 ms — populate the cache with SET product:42 ... EX 3600, then return. Next read for product 42 in the next hour is a hit.

💀 Failover flow — Primary 2 dies at 14:05:00

Primary 2's host reboots due to a hardware fault. All TCP connections to it drop instantly.
Within 1 second, the Failover Coordinator ⑧ (Sentinel) notices health checks failing. It waits for quorum (3 of 5 Sentinels agreeing) — ~5 seconds — to avoid false-positive flaps.
Sentinel picks Replica 2 ⑤ (most up-to-date) and sends SLAVEOF NO ONE — Replica 2 is now a primary.
Sentinel writes the new mapping (arc 90-180° → Replica-2) to the Topology Service ③.
The Smart Driver ① on each app server gets a connection error → refreshes topology → discovers the new primary → routes the next GET product:42 to Replica 2.
Total user-visible blast radius: ~5-30 seconds of elevated errors for the 1/16th of keyspace owned by Primary 2. The other 15/16 of the cluster never noticed. Zero data loss for keys replicated before the crash.

So what: the architecture is built around three insights — (1) sharding via consistent hashing distributes load and shrinks the blast radius of topology changes; (2) per-shard replication turns a node death into a 10-second blip instead of an outage; (3) smart routing keeps every read at one network hop. Every box in the diagram earns its place by removing one of the four failure modes from Pass 1.

Step 6

Consistent Hashing — The Core Algorithm

Of every idea in distributed systems, consistent hashing is the one most worth understanding deeply — it powers Cassandra, DynamoDB, Memcached, Redis Cluster, and CDN routing. The whole idea is one trick that solves one problem.

The problem with naive `hash % N`

Suppose you have 4 cache nodes and use shard = hash(key) % 4. user:42 hashes to 7 → shard 3. Easy. Now add a fifth node and switch to % 5. user:42 now hashes to 7 → shard 2. The key just moved to a different node. Worse — almost every key now lives on a different node. You've invalidated nearly your entire cache, the new node has to be populated from scratch, the backing DB takes the full traffic for hours. Adding capacity has caused an outage.

The consistent-hashing trick

Imagine a circular number line from 0 to 2³² (the output range of a 32-bit hash). Now:

Hash each node ID onto the ring → node A lands at position 12B, node B at 800M, node C at 3.1B.
Hash each key onto the ring the same way.
The key's owner is the first node clockwise from the key's position.

flowchart LR subgraph RING["Consistent Hash Ring"] direction TB A["Node A — pos 12B"] B["Node B — pos 800M"] C["Node C — pos 3.1B"] K1["key user:42 — pos 1.5B — owned by Node C"] K2["key product:99 — pos 200M — owned by Node B"] K3["key cart:7 — pos 3.5B — owned by Node A"] end K1 --> C K2 --> B K3 --> A style A fill:#171d27,stroke:#e8743b,color:#d4dae5 style B fill:#171d27,stroke:#4a90d9,color:#d4dae5 style C fill:#171d27,stroke:#38b265,color:#d4dae5 style K1 fill:#171d27,stroke:#7b8599,color:#d4dae5 style K2 fill:#171d27,stroke:#7b8599,color:#d4dae5 style K3 fill:#171d27,stroke:#7b8599,color:#d4dae5

Why it survives node addition

Add Node D at position 2.0B. Only the keys whose ring positions fall between Node B (800M) and Node D (2.0B) change ownership — they move from C to D. Every other key stays put. With N evenly-distributed nodes, adding a node moves 1/N of keys — vs. ~100% with mod-N hashing. At 16 nodes, that's 6.25% rebalanced instead of 100%. The cache stays warm.

Virtual nodes — fixing distribution skew

Plain consistent hashing has a problem: if you only have 3 nodes, their random ring positions might leave one node with 60% of the arc and another with 10% — load imbalance. The fix: each physical node is hashed onto the ring at ~256 different positions ("virtual nodes" or "vnodes"). With thousands of points scattered around the ring, the law of large numbers makes the per-node arc length nearly uniform. Each physical node ends up owning ~1/N of the total arc, just split into 256 small pieces instead of one big one.

❌ `hash(key) % N`

Simple, fast. But: changing N means rehomeing nearly every key. A 4→5 node migration moves ~80% of the cache. Useless in a growing cluster.

✅ Consistent hashing with vnodes

Slightly more complex (need to maintain a sorted ring data structure). Wins: 4→5 node migration moves ~20% of keys (1/N of the new size). Distribution is uniform thanks to vnodes. Production-grade.

The "so what" line: consistent hashing is what lets you add a cache node at 2 PM on a Tuesday without taking the site down. It's the difference between scaling being a routine deploy versus a midnight maintenance window.

Step 7

Replication Strategies

Replication is the answer to "what happens when a node dies?" but the choice of sync vs async is one of the deepest trade-offs in distributed systems — latency vs durability, picked once and lived with for years.

⚡ Async replication (default for caches)

The primary writes to its own RAM, returns success to the client immediately, and ships the change to the replica in the background. Write latency: ~0.2 ms (the local write only). Risk: if the primary crashes after acking but before replicating, the last few milliseconds of writes are lost.

When this is right: caches. Losing a few writes during a node crash is acceptable — the application will repopulate from the DB on the next read. The latency win (every write is fast) is worth far more than the durability cost.

🔒 Sync replication

The primary writes to its own RAM AND ships the change to the replica, waits for the replica's ack, then returns success to the client. Write latency: ~1-2 ms (one extra network round trip). Risk: none for data loss; but a slow/dead replica makes every write slow.

When this is right: when the cache holds something you can't easily reconstruct — session tokens, rate-limit counters, distributed locks. The strict consistency guarantee is worth the latency penalty.

The default — async, with smart caveats

Most production caches default to async because the typical use case (database query results, rendered HTML fragments, computed values) is fully reconstructable from the source of truth. The 1-100ms of writes potentially lost on a crash are recovered by the next cache miss naturally.

The clever middle ground used by Redis: WAIT N timeout_ms — a write is async by default, but the application can opt-in to "wait for N replicas to ack before continuing" for specific critical writes (e.g., committing a payment session). Best of both worlds — fast by default, durable when it matters.

The mental model: a cache is not a database. Async replication trades a small data-loss risk for huge latency wins, and that trade is almost always correct because the data isn't unique to the cache — the database has the canonical copy.

Step 8

Eviction Policies — When Memory Fills Up

The cache holds 64 GB. The application is writing 200K SETs/sec. Eventually 64 GB gets full. What do we drop to make room? This is the eviction problem, and the answer matters because the wrong policy can tank your hit rate by 30%.

🥇 LRU — Least Recently Used (default)

Drop the entry that has gone the longest without being read. Implemented as a hash map plus a doubly-linked list — every access bumps the entry to the head of the list; eviction always pops the tail. O(1) for both operations.

Wins for: "things you used recently are likely to be used again soon" workloads — most web caches, session stores, rendered-page caches. Default for Memcached and Redis.

Trap: a one-time scan that touches a million cold keys (e.g., a backup job) bumps every cold key to "recently used", evicting genuinely-hot entries. Workaround: use RESP CLIENT NO-TOUCH for batch jobs.

🥈 LFU — Least Frequently Used

Drop the entry with the fewest total reads (over a recent window). Tracks per-key access counter that decays over time.

Wins for: highly-skewed access patterns where a small set of keys is consistently hot — celebrity profiles, top-100 product pages, viral video metadata. Outperforms LRU by 5-15% hit rate when the long tail is dominated by a small head.

Cost: more bookkeeping per key (counter + decay). Redis 4+ supports it as maxmemory-policy allkeys-lfu.

🥉 TTL-based — drop expiring keys first

If multiple keys are eligible, prefer the one with the soonest expiration time. Often combined with LRU as a tie-breaker (volatile-lru: only evict keys that have a TTL set).

Wins for: caches that mix permanent reference data and short-TTL session data — evict the sessions first, leave the permanent stuff alone.

🎲 Random — surprisingly OK

Pick a random victim. Zero bookkeeping. Sounds dumb, but at high hit rates (over 95%) the difference between Random and LRU is tiny — most keys are hot, so any one you pick to evict was probably cold. Used by Redis as allkeys-random for low-overhead deployments.

Wins for: low-overhead workloads where the hit rate is dominated by working-set size, not by access recency.

Probabilistic LRU — the production reality

True LRU requires a doubly-linked list of every key, which costs ~32 bytes of overhead per key. With 64 GB of values, that overhead can be gigabytes. So Redis cheats: when eviction is needed, it samples 5 random keys, evicts the one with the oldest access time, and moves on. Not a true global LRU, but statistically close — and it costs zero per-key overhead. The maxmemory-samples tunable controls the sample size.

The pragmatic default: start with allkeys-lru, watch your hit rate. If you have a clear power-law access pattern (some keys always hot), switch to allkeys-lfu — typically 5-10% hit-rate gain. Don't agonize over the choice initially; eviction policy matters less than having enough RAM in the first place.

Step 9

Cache Invalidation Strategies

"There are only two hard things in computer science: cache invalidation and naming things." The reason it's hard: when the database changes, the cached copy must somehow learn about it — and there's no universally right way to do that. Three patterns dominate.

🔒 Write-through

Every write goes to the cache AND the DB synchronously, in the same code path. App returns success only after both succeed.

Pros: cache is always consistent with DB. No stale reads ever.

Cons: writes are slower (two systems instead of one). Both must be available — cache outage blocks all writes.

Use when: read-after-write consistency is mandatory (e.g., e-commerce inventory, account balances).

🚀 Write-around

Write goes to the DB only. Cache is populated lazily on the next read (cache miss → fetch from DB → store).

Pros: writes are fast (one system). No "writing data nobody will ever read" wasted RAM.

Cons: first read after a write is a miss. For a few microseconds, a stale cached entry might still be served if invalidation is missed.

Use when: write-heavy workloads where writes are rarely re-read soon (logs, infrequent profile updates). Most common pattern in practice.

⚡ Write-back / write-behind

Write goes to the cache only. A background process flushes dirty entries to the DB asynchronously.

Pros: fastest possible writes — RAM speed.

Cons: if the cache crashes before flushing, those writes are lost forever. Most caches don't support this safely.

Use when: you can tolerate data loss (analytics counters, click counts) and absolutely need RAM-speed writes.

The dirty secret of cache invalidation: there's a race condition in every pattern. Even write-through has a window where the DB has acked but the cache write is in flight — a concurrent reader might see the old cached value. The pragmatic answer is short TTLs (60-300 seconds) so any leaked staleness self-heals, plus explicit DEL on writes. Combining write-around with TTL covers 95% of cases.

Step 10

The Cache-Aside Pattern

The single most common cache integration in production code. Every senior engineer should have this written into muscle memory. Here it is in 12 lines.

Cache-aside — read path

function getUser(long userId) {
  String key = "user:" + userId;

  // 1. Try cache first
  User cached = cache.get(key);
  if (cached != null) return cached;   // HIT — done in 0.4ms

  // 2. MISS — load from DB
  User user = db.findById(userId);          // ~8ms
  if (user == null) return null;

  // 3. Populate cache for next time, with TTL
  cache.set(key, user, 3600);          // 1 hour TTL
  return user;
}

Cache-aside — write path (the trick)

function updateUser(User user) {
  // 1. Write to DB first (source of truth)
  db.save(user);

  // 2. INVALIDATE the cache, do NOT update it
  cache.delete("user:" + user.id);   // next read will repopulate
}

Why `DEL` and not `SET` on writes?

This is the subtle part that catches most engineers. Updating the cache directly seems faster — "write to DB and cache in parallel". But consider this race:

Request A reads user from DB: {name: "Sarah"}.
Request B updates user in DB to {name: "Sarah Smith"}, then sets cache.
Request A (delayed) now sets the cache to {name: "Sarah"} — stomping on the newer write.
Cache now permanently holds the wrong value until TTL.

Using DEL instead of SET on writes eliminates this entire class of bug — there's no value to stomp on, just a hole the next reader will fill from the DB (which has the truth).

The fridge analogy: cache-aside is like checking the fridge before going to the store. If milk's there, drink it (cache hit). If not, go buy some, put it in the fridge for next time (cache miss + populate). When you finish a carton (write/delete), throw it away — don't try to refill it with whatever you had lying around (that's stomping). The fridge is always either empty (miss) or holds the latest carton you bought (always fresh).

Step 11

The Hot Key Problem

It's 9 PM. A celebrity tweets a link to your top-trending video. The key video:viral_42 goes from 100 req/sec to 100,000 req/sec in under a minute. That single key's owner — one shard out of 16 — is now saturating its 10 Gbps NIC, while the other 15 shards sit at 5% utilization. Welcome to the hot key problem — the most common production headache in cached systems.

Why sharding doesn't help here

Sharding distributes load across shards, but a single key has only one home. No matter how many nodes you add, all 100K req/sec land on the one shard that owns video:viral_42. Adding more nodes literally cannot help.

🛡️ Solution 1 — Client-side cache (L1)

The app server itself caches ultra-hot keys in its local process memory for 1-5 seconds. With 1000 app servers each holding the value locally, the distributed cache sees almost zero traffic for that key — 1 fetch per app server every 5 seconds, instead of 100K fetches/sec aggregate. Caffeine, Guava Cache, or a small HashMap with TTL.

Trade-off: 1-5 seconds of staleness across the fleet. Acceptable for views, dangerous for prices.

🌊 Solution 2 — Replica reads

Configure the smart driver to spread reads across the primary and all its replicas (e.g., READONLY mode in Redis Cluster). With 1 primary + 2 replicas, hot-key load divides by 3. Helps but doesn't scale infinitely.

Trade-off: replicas are slightly stale (async replication), so this can return last-second data.

🔀 Solution 3 — Key sharding

Store the hot value under N suffixed keys: video:viral_42:v1 through video:viral_42:v10. On read, pick a random suffix → load divides by 10 across 10 different shards. On write, update all 10 copies (acceptable cost for a single hot record).

Trade-off: 10× write amplification for hot keys, plus invalidation must touch all replicas.

How do you even detect a hot key?

The cache emits per-key access samples (Redis --hotkeys, Memcached stats slabs, or eBPF probes on the network layer). A monitoring service watches for any key whose req/sec exceeds, say, 5× the average. Some advanced systems automatically apply solution 3 (auto-shard the key) when a threshold is crossed — Twitter's old "Cache+Cassandra" stack famously did this.

The newbie pitfall: people see "100K req/sec on one shard" and try to add more shards. It doesn't help — sharding splits across keys, not within a single key. The fix is one of the three above, not a bigger cluster.

Step 12

Thundering Herd / Cache Stampede

It's 3:00:00 AM. The cached entry for homepage:html has a 1-hour TTL and just expired. In the same millisecond, 1,000 concurrent web requests all try to load the homepage. They all hit the cache → all miss → all hammer the database with the same query simultaneously. The DB, sized for normal cache-fronted load, gets clobbered by 1,000× normal traffic. p99 spikes from 50ms to 30 seconds. The DB CPU pegs and replication lag explodes.

This is the cache stampede (also called "dogpile" or "thundering herd"), and it's the most dangerous failure mode in cache-fronted systems because it always strikes at the worst possible time — when something hot just expired.

🤝 Solution 1 — Request coalescing

When 1,000 requests miss the same key simultaneously, only the first one fetches from the DB. The other 999 wait on a shared CompletableFuture that the first request will complete. When the first request finishes, all 1,000 get the same answer. DB load: 1, not 1000.

Implemented in-process with a ConcurrentHashMap<Key, Future>; some libraries (Caffeine, Guava) have it built in.

🎲 Solution 2 — Probabilistic early refresh

Don't let the entry actually expire. Each read computes a probability of refresh based on remaining TTL: p = exp(-TTL_remaining / mean_lifetime). Most reads do nothing. A random few reads, becoming more likely as expiry approaches, asynchronously refresh the entry. The cache is statistically never empty — no stampede.

This is the XFetch algorithm; brilliant once you see it.

⏰ Solution 3 — Stale-while-revalidate

Serve the stale entry to readers while one async worker refreshes it in the background. Readers never wait for the DB; they get an old (but recent) value while the refresh happens out-of-band. Pattern made famous by HTTP Cache-Control: stale-while-revalidate and adopted by most modern caches.

Requires storing a "soft TTL" (refresh after) and a "hard TTL" (expire after) for each entry.

The lock-based variant — single-flight

A simple version of request coalescing: when you miss, try to SETNX recompute_lock:key. If you got the lock, recompute and write the entry. If you didn't get the lock, sleep 50ms and retry the cache. Works well for moderate concurrency; falls over above ~10K concurrent waiters per key (use Solution 1 instead).

The "so what" line: a stampede is a cache miss multiplied by your request concurrency. The fix is always to ensure that only one fetch happens per key per time window, no matter how many readers are waiting. Pick whichever solution matches your stack — but pick one, because production traffic will find the unprotected key sooner or later.

Step 13

Persistence & Recovery

Most caches are pure RAM. Some optionally persist to disk so a node restart doesn't mean cold-start. The two patterns and the do-nothing case:

📸 RDB — point-in-time snapshot

Every N minutes (configurable), the cache forks and writes its full RAM contents to a compact binary file on disk. On restart, load the file into RAM in seconds and resume.

Pros: compact files, fast warm-up. Cons: all writes between snapshots are lost on crash. Forking briefly doubles memory usage during the snapshot.

📜 AOF — append-only log

Every write command (SET, DEL, INCR) is appended to a log file as it happens. On restart, replay the log to reconstruct state.

Pros: minimal data loss (down to last command with fsync always). Cons: file grows large, replay is slow, fsync hurts write throughput.

🚫 No persistence (Memcached default)

Cache is pure RAM. Restart = empty cache. Application falls back to DB until the cache repopulates organically over a few minutes.

Pros: simpler, faster, no disk I/O. Cons: cold-start hammers the DB; planned reboots cause brief latency spikes.

The hybrid Redis approach

Run AOF + occasional RDB. AOF provides safety (~1s data loss with fsync everysec), RDB provides fast restart (load the snapshot, then replay only the AOF entries since the snapshot). When AOF gets too big, the cache rewrites it from the current state — keeping the file size bounded. Best of both worlds at modest disk I/O cost.

The pragmatic call: for a pure cache (data exists in DB), no persistence is fine — accept cold-start cost. For a cache that holds non-reconstructable data (sessions, rate-limit counters, distributed locks), persistence is mandatory. Most teams default to AOF + RDB and never think about it again.

Step 14

Topology Changes — Adding & Removing Nodes

The cluster grew. Or it shrank. Or a node died and we replaced it. How does the cluster handle membership changes without dropping requests?

Adding a node — gradual migration

The new node joins the ring at position X, claiming an arc that was previously owned by its clockwise neighbor. The migration:

Add to topology — the topology service records the new node. Clients see the new ring on next refresh.
Old owner serves until handoff — for the migrating arc, both old and new nodes are valid. Reads still hit the old owner. Writes can dual-write to both during the transition.
Background copy — the new node pulls keys from the old owner in batches. Throttled to avoid saturating the network.
Cutover — once 99%+ of keys are copied, flip the topology so reads go to the new node. Old node serves any remaining requests as a fallback for ~1 minute, then drops the keys.

Removing a node — drain first, then remove

The reverse: drain the leaving node by copying its keys to its clockwise neighbor first. Once empty, update topology to skip the node. Clients never see a missing-shard error.

Node death — replica promotion

If a primary dies suddenly (no graceful drain), Sentinel promotes its replica. Topology updates. Clients refresh. The keys on the dead primary that hadn't yet replicated are lost (acceptable for cache use cases). Replication factor stays at 1 until a new replica is provisioned and caught up.

The user-visible impact: with proper migration, adding or removing a node should cause zero visible errors — just a few minutes of slightly elevated cache miss rate (because some keys are temporarily in transition and miss-then-repopulate). Killing a node causes 5-30s of elevated errors for ~1/N of keyspace, then back to normal. The cluster has no global outage at any size.

Step 15

Multi-Region Deployment

Users in Tokyo shouldn't pay 200ms transpacific latency to hit a US-East cache. The fix is regional clusters with cross-region replication — and the choice of replication topology is a classic CAP-theorem trade-off.

🌐 Active-Active (multi-master)

Every region accepts both reads and writes. Writes replicate to all other regions asynchronously. Eventually consistent — a write in US might take 200ms to show up in Tokyo.

Pros: low-latency writes everywhere; no single-region failure mode. Cons: conflict resolution gets ugly (two regions write to the same key) — needs CRDTs or last-write-wins. Use when: regional independence matters more than global consistency (social feeds, content metadata).

🎯 Active-Passive (single-writer)

One region accepts writes; others are read-only replicas. Writes flow from the primary region to all secondaries asynchronously. A regional failover process promotes a secondary on disaster.

Pros: no conflicts, simple consistency model, strict global ordering. Cons: writes from far-away regions pay the round-trip; primary region failure needs orchestrated failover. Use when: consistency matters (financial caches, account state).

The classic question to ask: "if a write happens in Tokyo and a read happens in Frankfurt 100ms later, can the read return the old value?" If yes — active-active is fine. If no — active-passive with all writes funneled to one region. There is no third option, and the trade-off is fundamental to physics, not engineering.

Step 16

Interview Q&A

Why consistent hashing over hash % N?

Adding or removing a node moves only 1/N of keys instead of nearly all of them. With hash % N, going from 4 nodes to 5 changes nearly every key's home — invalidating ~80% of the cache and forcing a stampede on the backing DB. Consistent hashing maps keys to a ring; adding a node only steals an arc from one neighbor. At 16 nodes, that's 6.25% rebalanced vs 100%. Combined with virtual nodes for uniform distribution, this is the difference between scaling being a routine deploy and a midnight outage.

What happens to cache hit rate when you add a node?

It briefly drops, then recovers. The newly-added node's arc claims keys that used to live on its neighbor. Until those keys are migrated (or repopulated organically by misses + DB fetches), reads to that arc miss. With consistent hashing, only ~1/(N+1) of keyspace is affected — a 5-10% temporary hit-rate dip for a few minutes. Without consistent hashing (mod-N), nearly the entire cache invalidates and the DB takes the full miss traffic, often crashing it.

Cache-aside vs write-through — when each?

Cache-aside (write to DB, delete from cache; read populates cache on miss) is the default for ~95% of cases. Simple, predictable, no race conditions because DEL-not-SET. Write-through (write to both atomically) is for cases where readers must never see a stale value — financial balances, inventory counts. The cost: every write is slower (two systems), and a cache outage blocks all writes. Most production code defaults to cache-aside + short TTL because it's resilient to half-broken cache infrastructure; write-through is reserved for the few keys where staleness is unacceptable.

How do you handle a hot key getting 100K req/sec?

Sharding doesn't help — a single key has one home. Three real fixes: (1) Client-side L1 cache in each app server holds the hot value for 1-5 seconds, reducing distributed-cache load by ~1000×. (2) Replica reads spread the load across primary + replicas (1 primary + 2 replicas = 3× capacity per shard). (3) Key sharding — store the value under 10 suffixed keys (key:v1...key:v10), random suffix on read; load divides by 10 across multiple physical shards. Detection: monitor per-key QPS, alert at 5× average. Often combine all three for true-viral content.

What's the cache stampede problem and how do you fix it?

When a hot key expires, hundreds of concurrent reads all miss simultaneously and hammer the DB. The DB, sized for cache-fronted load, gets crushed. Three fixes: (1) Request coalescing — only the first request fetches; the other N-1 wait on a shared Future and get the same result. DB load: 1, not N. (2) Probabilistic early refresh — random reads near expiry asynchronously refresh the entry, so it never actually empties. (3) Stale-while-revalidate — serve the old value while one async worker refreshes. Implementations like Caffeine and Redis Lock-and-Refresh have these built in. Pick one — production traffic will find unprotected hot keys eventually.

Sync vs async replication for caches — when to choose each?

Async by default — the data isn't unique to the cache. If a primary crashes after acking a write but before replicating, you lose ~1-100ms of writes — but the application repopulates from the DB on the next read, so users never notice. Async wins ~5× on write latency. Sync for the rare cases where the cache holds non-reconstructable data: distributed locks, rate-limit counters, session tokens. Redis offers a middle ground: WAIT N timeout opts a single critical write into sync replication while keeping everything else fast. Always default to async; explicitly opt-in to sync when you can't afford the data loss window.

What's the failure mode if Sentinel itself crashes?

You always run a Sentinel quorum (3 or 5 nodes), never just one. Sentinel decisions require majority agreement — losing 1 of 3 or 2 of 5 still leaves a working quorum. If quorum is lost (network partition splitting Sentinels), no failover happens — primaries continue serving but no automated recovery is possible until the partition heals. The data plane (cache nodes) is unaffected by Sentinel issues; you just lose the ability to auto-failover during the Sentinel outage. For critical deployments, run Sentinels in 3+ availability zones so any single AZ failure leaves quorum intact.

How do you decide what to cache vs. what to leave on the DB?

Three criteria. (1) Read frequency — items read more than ~10× their write rate benefit from caching; rarely-read items don't earn their RAM. (2) Cost of regeneration — expensive joins, ML model outputs, rendered HTML are huge wins; raw row lookups by primary key on a fast DB are marginal. (3) Tolerance for staleness — if you can accept seconds-stale data, almost anything is cacheable; if you need always-fresh, write-through or skip caching. Combine these: high read freq + expensive to compute + can tolerate seconds of staleness = cache it aggressively. Low on any one = think twice.

The one-line summary the interviewer remembers: "It's a sharded ring of in-memory nodes using consistent hashing for stable routing, async-replicated for fault tolerance, accessed via smart drivers or a thin proxy, with cache-aside as the default integration pattern — protected from hot keys with client-side L1 and from stampedes with request coalescing."

Distributed Cache