From "one Redis box" to a sharded, replicated, consistent-hashed cluster — how Memcached and Redis Cluster deliver sub-millisecond reads at 100M QPS
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
It's 7:30pm on Cyber Monday. A new product page goes viral on TikTok and the link starts pinging Maya's e-commerce site at 1 million requests per second. Without a cache, every single page load runs the same Postgres query — SELECT * FROM products WHERE id = 42 — and a single Postgres box that comfortably serves 20K reads per second now drowns in fifty times its rated load. The DB CPU pegs at 100%, queries pile up in the wait queue, and within 90 seconds the entire shop returns 503s. A multi-million-dollar sales day, gone.
Now imagine the same scenario with a distributed cache sitting in front of Postgres. The first user who loads the page runs the DB query — 8 milliseconds — and the result gets stashed in the cache. Every other user for the next hour gets the cached copy in 0.4 milliseconds, served straight from RAM on a cache server. 99% of those 1M requests never touch Postgres at all. The DB sees its normal trickle, the page stays fast, and Maya keeps every dollar.
A distributed cache is, in plain English, a network-accessible, in-memory key-value store that sits between your application and your slower database. "Distributed" because no single box has enough RAM (or enough network bandwidth) to serve all the traffic — so we spread the data across many boxes and route each key to the right one. Memcached and Redis Cluster are the two most famous open-source examples; AWS ElastiCache and Google Memorystore are the managed equivalents.
Before drawing a single ring or arrow, pin down what the cache must do — and what it explicitly will not do. A cache is not a database; setting that boundary up front prevents half the design mistakes.
GET key — fetch a value by key, return null on missSET key value [TTL] — store a value, optionally with time-to-liveDELETE key — explicit invalidationMGET k1 k2 k3, MSET — amortize network round tripsINCR key, CAS key version value for counters and optimistic locksNumbers drive every box on the diagram. Let's size for a realistic large-scale deployment — what you'd find at a top-100 web property.
Target: 100 million requests per second across the cluster, with 1 TB of total cached data, ~95% read / 5% write split (typical for read-through caches). Average value size: 1 KB.
100M req/sec
Aggregate read+write
~200K req/sec
Headroom for hot shards
64 GB
1 TB / 16 primaries
32 nodes
16 primary + 16 replica
200K req/sec × 1 KB/value = 200 MB/sec per node, or ~1.6 Gbps. That fits comfortably within a 10 Gbps NIC — but it's a real number that constrains how hot a single shard can get before we have to split it.
Pick the smallest cluster that meets every constraint with headroom. 8 nodes × 200K = 1.6M QPS — way under target. 32 nodes — overspending and harder to operate. 16 nodes × ~200K = 3.2M sustained, with bursting room to 100M during traffic spikes assisted by the application's own request pipelining and batching. Each primary gets a replica, so the operational cluster is 32 nodes total.
| Metric | Value | Why it matters |
|---|---|---|
| Total cached data | 1 TB | Forces sharding — no single box has 1 TB of RAM |
| Per-node RAM | 64 GB | Standard cache instance class (r6g.2xlarge, etc.) |
| Cluster QPS | 100M | Drives node count and replication degree |
| p99 latency target | under 1ms | Forces in-memory storage and same-AZ deployment |
| Replicas per primary | 1-2 | Survives node + AZ failure without data loss |
The Redis/Memcached protocol is delightfully boring, which is the point — clients connect over a single TCP socket, send text commands, and parse text responses. No HTTP overhead, no JSON, no auth handshake on every call. The protocol is so thin that a single CPU core can serve 200K+ commands/sec.
Cache command surface// Read — the bread and butter GET user:42 // → "{name:'Sarah',email:'...'}" or null MGET user:42 user:43 user:44 // batch — one round trip for N keys // Write — populate or update SET user:42 "" EX 3600 // EX = expire in seconds MSET user:42 "v1" user:43 "v2" // batch write SETNX lock:order:99 "owner-A" // SET if Not eXists — atomic lock // Delete / invalidate DEL user:42 // remove a key // Atomic counters — the killer feature INCR page:home:views // atomic +1, returns new value INCRBY rate:user:42 1 // rate-limit token-bucket math // Compare-and-set (optimistic lock) GETS cart:99 // returns value + version (CAS token) CAS cart:99 v=37 " " // succeeds only if version still 37 // Pub/sub (Redis only — fan-out notifications) PUBLISH channel:orders "order-99-paid" SUBSCRIBE channel:orders
MGET matters more than people realize: rendering a single web page often needs 20-50 cache lookups (user, profile, recent orders, friend list, ...). Doing them one by one is 20 round trips × 0.5ms = 10ms wasted on network. MGET bundles them into one round trip — 0.6ms total. That's a 16× speedup at zero cost. Always batch.cart:99, both add an item, both write back — last write wins, one item silently lost. CAS includes a version token in the read; the write only succeeds if the version still matches. Race resolved without a heavyweight lock.This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. The numbers from §3 drive every decision.
Sketch the simplest possible cache: one Redis box that the entire application connects to. App servers send GET/SET commands over TCP, the box answers from RAM, everyone goes home happy.
Four concrete failures emerge the moment real traffic shows up:
Every app server in the fleet hits the same box. A single 10 Gbps NIC tops out around 100K requests/sec at 1 KB values once you account for TCP overhead and CPU pinning. Need 100M? You'd need a 10 Tbps link to one machine — which doesn't exist.
Even high-end cache instances cap at ~256 GB RAM. We need 1 TB. There is no single machine you can buy that holds the dataset — physics says we must shard.
The box reboots after a kernel panic. RAM clears. Suddenly every app server starts hitting the backing DB on every request. Postgres, sized for 5% of traffic, drowns under 100% — a "cache stampede" cascade outage that takes down the whole service.
Day 1 of a new deploy: cache is empty, every app request misses. Backing DB sees the full 100M req/sec for the first few minutes — long enough to crash it. Same problem hits a fresh node added to expand capacity.
The single most important set of insights in distributed cache design boils down to three ideas working together. Internalize these and the rest of the architecture writes itself.
Distribute keys across N nodes by hashing each key onto a circular ring. Each node owns an arc of the ring. The magic property: adding a node only relocates 1/N of keys — vs. nearly 100% with naive hash % N. This is what lets us grow the cluster without an outage.
Every primary node has a replica that mirrors its data via async replication. When a primary dies, the replica gets promoted in seconds. The shard's slice of the cache survives the failure — no cold-start, no DB stampede.
Either the client driver knows the ring (Redis Cluster's "smart client") and connects directly to the right shard, or a thin routing proxy (Twemproxy, Envoy) fronts the cluster. Either way, no extra hop — keys land on their owning shard in one network call.
The ring + replication + smart routing trio gives us horizontal scale, fault tolerance, and sub-millisecond latency simultaneously. Every other component on the production diagram is in service of one of these three ideas.
Now the full picture. Every node is numbered — find its matching card below to see what it does and what would break without it.
Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.
The application — a web server, mobile API gateway, batch worker — that wants to read or write a cache entry. The "smart driver" (e.g., Jedis, lettuce, redis-py-cluster) holds an in-memory copy of the ring topology. When the app calls GET user:42, the driver hashes user:42, looks up which primary owns that arc of the ring, and opens a TCP connection straight to that node — no extra hop, no proxy needed.
Solves: the routing problem with zero added latency. Without smart drivers, every client would either need to know all nodes (and risk hitting the wrong one) or pay for a proxy hop. Smart drivers make the cluster look like a single logical cache to the application.
A thin abstraction inside the application that implements the cache-aside pattern: try cache first, on miss fetch from DB and populate the cache, return the value. On writes, update the DB then DEL the cache key (don't try to update the cache value directly — that's a race condition waiting to happen). This is application code, not infrastructure, but it's drawn in the architecture because the choice of pattern materially shapes the system.
Solves: the "cache is just a side store" mental model. Without an explicit pattern, developers do ad-hoc caching, forget invalidation, and ship stale-data bugs. Having one well-tested helper (getOrFetch(key, () -> db.load(key))) makes correctness the default.
An optional thin TCP proxy that fronts the cluster for clients that don't have smart drivers (legacy apps, scripting languages, multi-language polyglot fleets). The proxy reads the topology from ZooKeeper/etcd, accepts plain Redis or Memcached protocol, and forwards each command to the right shard. Adds ~0.1-0.3ms per call.
Solves: heterogeneous client environments. A 200-microservice fleet in 8 languages can't all upgrade to smart drivers simultaneously — the proxy lets them keep using a vanilla Redis client and still get sharded routing. Without it, you'd either run 200 driver upgrade projects or accept dumb routing.
A small, highly-available coordination service (ZooKeeper, etcd, Consul, or Redis Sentinel for Redis-specific deployments) that holds the canonical mapping from ring arc → primary node → replica node. Smart drivers and proxies refresh this every few seconds (or on a routing miss). When Sentinel promotes a replica during a failover, it updates the topology here, and clients pick up the change on their next refresh.
Solves: the "where does this key live right now?" problem in a cluster where membership is changing. Without it, a client whose driver was cached an hour ago would route to a dead node forever. Topology services are the cluster's nervous system.
The 16 primary nodes that actually hold the cached data. Each owns an arc of the consistent-hash ring — typically 1/16th of the keyspace, but adjusted with virtual nodes (~256 vnodes per physical node) for uniform distribution. Each primary holds ~64 GB of data and serves ~200K req/sec on commodity hardware. Single-threaded event loop (Redis) or multi-threaded with per-thread cache (Memcached) — both squeeze enormous throughput out of one CPU socket.
Solves: the throughput-and-RAM ceiling from Pass 1. 16 primaries collectively serve 100M req/sec and hold 1 TB. Without sharding, you'd need a single machine with impossible specs.
Each primary has 1-2 replicas in different availability zones. The replica subscribes to the primary's write stream — every SET, DEL, INCR is asynchronously shipped over and replayed on the replica. Lag is typically under 100ms. When a primary dies, the failover coordinator promotes the replica and the cluster keeps serving from RAM — no DB stampede.
Solves: the "one crash = total cache loss" failure mode. With replication, a primary death loses 0 - 100ms of in-flight writes (acceptable for a cache), and the replicated 64 GB of hot data stays warm. Cluster availability stays at 99.99%.
A per-node component that runs when memory pressure crosses a configured threshold (e.g., 95% of maxmemory). It picks victims based on the configured policy — LRU (default), LFU, TTL-first, or random — and removes them to make room for incoming writes. Implemented internally as a sampled probabilistic algorithm (Redis samples N keys and evicts the worst), not a strict global LRU, because true LRU would require a doubly-linked list of every key in RAM.
Solves: the bounded-memory problem. A cache with no eviction either OOMs or starts rejecting writes — both of which break the application. Eviction lets the cache always accept new entries while gracefully shedding cold ones.
Optional disk-backed storage for crash recovery (Redis supports it; Memcached does not). Two modes: RDB snapshots dump a point-in-time copy of all keys to disk every N minutes; AOF (append-only file) logs every write command so the cache can replay them on restart. Most production deployments run AOF + occasional RDB.
Solves: the cold-start problem after a planned restart or a region-wide power loss. Without persistence, a restarted node serves zero traffic for the first 10 minutes while organic reads slowly repopulate it — and the backing DB pays for that warm-up. With AOF, the node restarts already-warm.
A small cluster of 3-5 monitor processes that ping every primary every second. When a quorum of Sentinels agrees a primary is down, they pick the most up-to-date replica, promote it (it becomes the new primary), update the topology service, and notify clients to refresh. Failover typically completes in 5-30 seconds.
Solves: automated failover. Without it, a dead primary requires a human to manually promote a replica and update DNS — minutes-to-hours of partial outage. Sentinel turns that into seconds, fully automated.
Per-node telemetry pushed to Prometheus / Datadog / CloudWatch. The signals that matter: hit rate (cached / total reads — under 90% is a smell), eviction rate (high evictions = under-provisioned RAM), memory pressure (close to maxmemory = imminent eviction storm), p99 latency (sudden spike = hot key or slow command), replication lag (high lag = imminent data loss on failover).
Solves: visibility. Without metrics, you find out the cache is broken when the application starts timing out — minutes after it actually stopped working. With them, alerts fire before users notice.
The source of truth — Postgres, Cassandra, MySQL, whatever the application's primary store happens to be. The cache fronts it; on cache miss the app falls back here. Sized for the cold-tail traffic that misses the cache (typically 5-10% of total reads) plus the full write rate, not the full read rate. This is the entire reason caches exist: they let you size the DB for "miss traffic" instead of "all traffic", which is often a 10-20× cost savings.
Solves: nothing on its own — but the cache exists to protect it. Without the DB, the cache has no source of truth; without the cache, the DB has to be 10× bigger.
Two real flows mapped to the numbered components above:
cache.get("product:42").product:42 → ring position 142° → owned by Primary 2 ④ (arc 90-180°).GET product:42. Primary 2 looks up the key in its hash table — HIT — returns the value in 0.4 ms.SET product:42 ... EX 3600, then return. Next read for product 42 in the next hour is a hit.SLAVEOF NO ONE — Replica 2 is now a primary.arc 90-180° → Replica-2) to the Topology Service ③.GET product:42 to Replica 2.Of every idea in distributed systems, consistent hashing is the one most worth understanding deeply — it powers Cassandra, DynamoDB, Memcached, Redis Cluster, and CDN routing. The whole idea is one trick that solves one problem.
hash % NSuppose you have 4 cache nodes and use shard = hash(key) % 4. user:42 hashes to 7 → shard 3. Easy. Now add a fifth node and switch to % 5. user:42 now hashes to 7 → shard 2. The key just moved to a different node. Worse — almost every key now lives on a different node. You've invalidated nearly your entire cache, the new node has to be populated from scratch, the backing DB takes the full traffic for hours. Adding capacity has caused an outage.
Imagine a circular number line from 0 to 2³² (the output range of a 32-bit hash). Now:
Add Node D at position 2.0B. Only the keys whose ring positions fall between Node B (800M) and Node D (2.0B) change ownership — they move from C to D. Every other key stays put. With N evenly-distributed nodes, adding a node moves 1/N of keys — vs. ~100% with mod-N hashing. At 16 nodes, that's 6.25% rebalanced instead of 100%. The cache stays warm.
Plain consistent hashing has a problem: if you only have 3 nodes, their random ring positions might leave one node with 60% of the arc and another with 10% — load imbalance. The fix: each physical node is hashed onto the ring at ~256 different positions ("virtual nodes" or "vnodes"). With thousands of points scattered around the ring, the law of large numbers makes the per-node arc length nearly uniform. Each physical node ends up owning ~1/N of the total arc, just split into 256 small pieces instead of one big one.
hash(key) % NSimple, fast. But: changing N means rehomeing nearly every key. A 4→5 node migration moves ~80% of the cache. Useless in a growing cluster.
Slightly more complex (need to maintain a sorted ring data structure). Wins: 4→5 node migration moves ~20% of keys (1/N of the new size). Distribution is uniform thanks to vnodes. Production-grade.
Replication is the answer to "what happens when a node dies?" but the choice of sync vs async is one of the deepest trade-offs in distributed systems — latency vs durability, picked once and lived with for years.
The primary writes to its own RAM, returns success to the client immediately, and ships the change to the replica in the background. Write latency: ~0.2 ms (the local write only). Risk: if the primary crashes after acking but before replicating, the last few milliseconds of writes are lost.
When this is right: caches. Losing a few writes during a node crash is acceptable — the application will repopulate from the DB on the next read. The latency win (every write is fast) is worth far more than the durability cost.
The primary writes to its own RAM AND ships the change to the replica, waits for the replica's ack, then returns success to the client. Write latency: ~1-2 ms (one extra network round trip). Risk: none for data loss; but a slow/dead replica makes every write slow.
When this is right: when the cache holds something you can't easily reconstruct — session tokens, rate-limit counters, distributed locks. The strict consistency guarantee is worth the latency penalty.
Most production caches default to async because the typical use case (database query results, rendered HTML fragments, computed values) is fully reconstructable from the source of truth. The 1-100ms of writes potentially lost on a crash are recovered by the next cache miss naturally.
The clever middle ground used by Redis: WAIT N timeout_ms — a write is async by default, but the application can opt-in to "wait for N replicas to ack before continuing" for specific critical writes (e.g., committing a payment session). Best of both worlds — fast by default, durable when it matters.
The cache holds 64 GB. The application is writing 200K SETs/sec. Eventually 64 GB gets full. What do we drop to make room? This is the eviction problem, and the answer matters because the wrong policy can tank your hit rate by 30%.
Drop the entry that has gone the longest without being read. Implemented as a hash map plus a doubly-linked list — every access bumps the entry to the head of the list; eviction always pops the tail. O(1) for both operations.
Wins for: "things you used recently are likely to be used again soon" workloads — most web caches, session stores, rendered-page caches. Default for Memcached and Redis.
Trap: a one-time scan that touches a million cold keys (e.g., a backup job) bumps every cold key to "recently used", evicting genuinely-hot entries. Workaround: use RESP CLIENT NO-TOUCH for batch jobs.
Drop the entry with the fewest total reads (over a recent window). Tracks per-key access counter that decays over time.
Wins for: highly-skewed access patterns where a small set of keys is consistently hot — celebrity profiles, top-100 product pages, viral video metadata. Outperforms LRU by 5-15% hit rate when the long tail is dominated by a small head.
Cost: more bookkeeping per key (counter + decay). Redis 4+ supports it as maxmemory-policy allkeys-lfu.
If multiple keys are eligible, prefer the one with the soonest expiration time. Often combined with LRU as a tie-breaker (volatile-lru: only evict keys that have a TTL set).
Wins for: caches that mix permanent reference data and short-TTL session data — evict the sessions first, leave the permanent stuff alone.
Pick a random victim. Zero bookkeeping. Sounds dumb, but at high hit rates (over 95%) the difference between Random and LRU is tiny — most keys are hot, so any one you pick to evict was probably cold. Used by Redis as allkeys-random for low-overhead deployments.
Wins for: low-overhead workloads where the hit rate is dominated by working-set size, not by access recency.
True LRU requires a doubly-linked list of every key, which costs ~32 bytes of overhead per key. With 64 GB of values, that overhead can be gigabytes. So Redis cheats: when eviction is needed, it samples 5 random keys, evicts the one with the oldest access time, and moves on. Not a true global LRU, but statistically close — and it costs zero per-key overhead. The maxmemory-samples tunable controls the sample size.
allkeys-lru, watch your hit rate. If you have a clear power-law access pattern (some keys always hot), switch to allkeys-lfu — typically 5-10% hit-rate gain. Don't agonize over the choice initially; eviction policy matters less than having enough RAM in the first place."There are only two hard things in computer science: cache invalidation and naming things." The reason it's hard: when the database changes, the cached copy must somehow learn about it — and there's no universally right way to do that. Three patterns dominate.
Every write goes to the cache AND the DB synchronously, in the same code path. App returns success only after both succeed.
Pros: cache is always consistent with DB. No stale reads ever.
Cons: writes are slower (two systems instead of one). Both must be available — cache outage blocks all writes.
Use when: read-after-write consistency is mandatory (e.g., e-commerce inventory, account balances).
Write goes to the DB only. Cache is populated lazily on the next read (cache miss → fetch from DB → store).
Pros: writes are fast (one system). No "writing data nobody will ever read" wasted RAM.
Cons: first read after a write is a miss. For a few microseconds, a stale cached entry might still be served if invalidation is missed.
Use when: write-heavy workloads where writes are rarely re-read soon (logs, infrequent profile updates). Most common pattern in practice.
Write goes to the cache only. A background process flushes dirty entries to the DB asynchronously.
Pros: fastest possible writes — RAM speed.
Cons: if the cache crashes before flushing, those writes are lost forever. Most caches don't support this safely.
Use when: you can tolerate data loss (analytics counters, click counts) and absolutely need RAM-speed writes.
DEL on writes. Combining write-around with TTL covers 95% of cases.The single most common cache integration in production code. Every senior engineer should have this written into muscle memory. Here it is in 12 lines.
Cache-aside — read pathfunction getUser(long userId) { String key = "user:" + userId; // 1. Try cache first User cached = cache.get(key); if (cached != null) return cached; // HIT — done in 0.4ms // 2. MISS — load from DB User user = db.findById(userId); // ~8ms if (user == null) return null; // 3. Populate cache for next time, with TTL cache.set(key, user, 3600); // 1 hour TTL return user; }Cache-aside — write path (the trick)
function updateUser(User user) { // 1. Write to DB first (source of truth) db.save(user); // 2. INVALIDATE the cache, do NOT update it cache.delete("user:" + user.id); // next read will repopulate }
DEL and not SET on writes?This is the subtle part that catches most engineers. Updating the cache directly seems faster — "write to DB and cache in parallel". But consider this race:
{name: "Sarah"}.{name: "Sarah Smith"}, then sets cache.{name: "Sarah"} — stomping on the newer write.Using DEL instead of SET on writes eliminates this entire class of bug — there's no value to stomp on, just a hole the next reader will fill from the DB (which has the truth).
It's 9 PM. A celebrity tweets a link to your top-trending video. The key video:viral_42 goes from 100 req/sec to 100,000 req/sec in under a minute. That single key's owner — one shard out of 16 — is now saturating its 10 Gbps NIC, while the other 15 shards sit at 5% utilization. Welcome to the hot key problem — the most common production headache in cached systems.
Sharding distributes load across shards, but a single key has only one home. No matter how many nodes you add, all 100K req/sec land on the one shard that owns video:viral_42. Adding more nodes literally cannot help.
The app server itself caches ultra-hot keys in its local process memory for 1-5 seconds. With 1000 app servers each holding the value locally, the distributed cache sees almost zero traffic for that key — 1 fetch per app server every 5 seconds, instead of 100K fetches/sec aggregate. Caffeine, Guava Cache, or a small HashMap with TTL.
Trade-off: 1-5 seconds of staleness across the fleet. Acceptable for views, dangerous for prices.
Configure the smart driver to spread reads across the primary and all its replicas (e.g., READONLY mode in Redis Cluster). With 1 primary + 2 replicas, hot-key load divides by 3. Helps but doesn't scale infinitely.
Trade-off: replicas are slightly stale (async replication), so this can return last-second data.
Store the hot value under N suffixed keys: video:viral_42:v1 through video:viral_42:v10. On read, pick a random suffix → load divides by 10 across 10 different shards. On write, update all 10 copies (acceptable cost for a single hot record).
Trade-off: 10× write amplification for hot keys, plus invalidation must touch all replicas.
The cache emits per-key access samples (Redis --hotkeys, Memcached stats slabs, or eBPF probes on the network layer). A monitoring service watches for any key whose req/sec exceeds, say, 5× the average. Some advanced systems automatically apply solution 3 (auto-shard the key) when a threshold is crossed — Twitter's old "Cache+Cassandra" stack famously did this.
It's 3:00:00 AM. The cached entry for homepage:html has a 1-hour TTL and just expired. In the same millisecond, 1,000 concurrent web requests all try to load the homepage. They all hit the cache → all miss → all hammer the database with the same query simultaneously. The DB, sized for normal cache-fronted load, gets clobbered by 1,000× normal traffic. p99 spikes from 50ms to 30 seconds. The DB CPU pegs and replication lag explodes.
This is the cache stampede (also called "dogpile" or "thundering herd"), and it's the most dangerous failure mode in cache-fronted systems because it always strikes at the worst possible time — when something hot just expired.
When 1,000 requests miss the same key simultaneously, only the first one fetches from the DB. The other 999 wait on a shared CompletableFuture that the first request will complete. When the first request finishes, all 1,000 get the same answer. DB load: 1, not 1000.
Implemented in-process with a ConcurrentHashMap<Key, Future>; some libraries (Caffeine, Guava) have it built in.
Don't let the entry actually expire. Each read computes a probability of refresh based on remaining TTL: p = exp(-TTL_remaining / mean_lifetime). Most reads do nothing. A random few reads, becoming more likely as expiry approaches, asynchronously refresh the entry. The cache is statistically never empty — no stampede.
This is the XFetch algorithm; brilliant once you see it.
Serve the stale entry to readers while one async worker refreshes it in the background. Readers never wait for the DB; they get an old (but recent) value while the refresh happens out-of-band. Pattern made famous by HTTP Cache-Control: stale-while-revalidate and adopted by most modern caches.
Requires storing a "soft TTL" (refresh after) and a "hard TTL" (expire after) for each entry.
A simple version of request coalescing: when you miss, try to SETNX recompute_lock:key. If you got the lock, recompute and write the entry. If you didn't get the lock, sleep 50ms and retry the cache. Works well for moderate concurrency; falls over above ~10K concurrent waiters per key (use Solution 1 instead).
Most caches are pure RAM. Some optionally persist to disk so a node restart doesn't mean cold-start. The two patterns and the do-nothing case:
Every N minutes (configurable), the cache forks and writes its full RAM contents to a compact binary file on disk. On restart, load the file into RAM in seconds and resume.
Pros: compact files, fast warm-up. Cons: all writes between snapshots are lost on crash. Forking briefly doubles memory usage during the snapshot.
Every write command (SET, DEL, INCR) is appended to a log file as it happens. On restart, replay the log to reconstruct state.
Pros: minimal data loss (down to last command with fsync always). Cons: file grows large, replay is slow, fsync hurts write throughput.
Cache is pure RAM. Restart = empty cache. Application falls back to DB until the cache repopulates organically over a few minutes.
Pros: simpler, faster, no disk I/O. Cons: cold-start hammers the DB; planned reboots cause brief latency spikes.
Run AOF + occasional RDB. AOF provides safety (~1s data loss with fsync everysec), RDB provides fast restart (load the snapshot, then replay only the AOF entries since the snapshot). When AOF gets too big, the cache rewrites it from the current state — keeping the file size bounded. Best of both worlds at modest disk I/O cost.
The cluster grew. Or it shrank. Or a node died and we replaced it. How does the cluster handle membership changes without dropping requests?
The new node joins the ring at position X, claiming an arc that was previously owned by its clockwise neighbor. The migration:
The reverse: drain the leaving node by copying its keys to its clockwise neighbor first. Once empty, update topology to skip the node. Clients never see a missing-shard error.
If a primary dies suddenly (no graceful drain), Sentinel promotes its replica. Topology updates. Clients refresh. The keys on the dead primary that hadn't yet replicated are lost (acceptable for cache use cases). Replication factor stays at 1 until a new replica is provisioned and caught up.
Users in Tokyo shouldn't pay 200ms transpacific latency to hit a US-East cache. The fix is regional clusters with cross-region replication — and the choice of replication topology is a classic CAP-theorem trade-off.
Every region accepts both reads and writes. Writes replicate to all other regions asynchronously. Eventually consistent — a write in US might take 200ms to show up in Tokyo.
Pros: low-latency writes everywhere; no single-region failure mode. Cons: conflict resolution gets ugly (two regions write to the same key) — needs CRDTs or last-write-wins. Use when: regional independence matters more than global consistency (social feeds, content metadata).
One region accepts writes; others are read-only replicas. Writes flow from the primary region to all secondaries asynchronously. A regional failover process promotes a secondary on disaster.
Pros: no conflicts, simple consistency model, strict global ordering. Cons: writes from far-away regions pay the round-trip; primary region failure needs orchestrated failover. Use when: consistency matters (financial caches, account state).
hash % N?hash % N, going from 4 nodes to 5 changes nearly every key's home — invalidating ~80% of the cache and forcing a stampede on the backing DB. Consistent hashing maps keys to a ring; adding a node only steals an arc from one neighbor. At 16 nodes, that's 6.25% rebalanced vs 100%. Combined with virtual nodes for uniform distribution, this is the difference between scaling being a routine deploy and a midnight outage.DEL-not-SET. Write-through (write to both atomically) is for cases where readers must never see a stale value — financial balances, inventory counts. The cost: every write is slower (two systems), and a cache outage blocks all writes. Most production code defaults to cache-aside + short TTL because it's resilient to half-broken cache infrastructure; write-through is reserved for the few keys where staleness is unacceptable.key:v1...key:v10), random suffix on read; load divides by 10 across multiple physical shards. Detection: monitor per-key QPS, alert at 5× average. Often combine all three for true-viral content.Future and get the same result. DB load: 1, not N. (2) Probabilistic early refresh — random reads near expiry asynchronously refresh the entry, so it never actually empties. (3) Stale-while-revalidate — serve the old value while one async worker refreshes. Implementations like Caffeine and Redis Lock-and-Refresh have these built in. Pick one — production traffic will find unprotected hot keys eventually.WAIT N timeout opts a single critical write into sync replication while keeping everything else fast. Always default to async; explicitly opt-in to sync when you can't afford the data loss window.