From "one attacker can brute-force a password in 90 seconds" to a Redis-backed sliding-window limiter that protects every endpoint at sub-5ms latency
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
It's 02:14 AM. An attacker sitting behind a botnet starts hammering Sarah's bank login at 10,000 password guesses per second. With a six-character password, the entire keyspace is brute-forceable in under 90 seconds. Sarah's password is "summer22" — it falls on guess number 4,200,000, and her savings drain into a wallet in another country before her phone even buzzes. A rate limiter is the small piece of code that, between request #5 and request #6 in any 60-second window, says: "that's enough" and returns HTTP 429 Too Many Requests. The attack stops dead.
More formally: a rate limiter is a gate that decides — for every incoming request — whether the caller has stayed within an allowed budget (e.g., "100 requests per minute per user"). If yes, the request flows through. If no, the gate slams shut and returns 429, often with a Retry-After header telling the client when to try again. Conceptually it's the same thing as the bouncer at a club checking how many people from your group already went in tonight.
Rate limiting isn't a "nice to have" — it sits between your app and a long list of disasters. Some of those disasters are malicious; some are friendly fire from your own customers' buggy code. Either way, the symptoms look the same: traffic spike, latency tank, paged on-calls.
Distributed denial-of-service attacks try to drown your servers in traffic. Brute-force attacks try to drown a single endpoint (login, password reset, OTP verify) in guesses. Rate limiting throttles both at the edge before they reach app code.
A scraper that ignores robots.txt can hit /products/:id for every product in your catalog in minutes — pushing your DB cache hit rate from 95% to 30% and slowing every other user. Rate limit by IP and the scraper crawls you at human speed instead.
The most common offender isn't malicious — it's a customer's mobile app stuck in a retry loop after a backend hiccup. One bug pushes 2 million phones to retry every 100ms. Without a limiter, that takes you down. With one, each phone gets 429ed until the app ships a fix.
Every request you serve costs CPU, bandwidth, and downstream API calls (Stripe, Twilio, OpenAI). A scraper hammering an endpoint that internally calls GPT-4 can ring up $50K in a weekend. Rate limits cap that exposure to a known maximum.
"Free tier: 100 req/min. Pro: 10K req/min. Enterprise: unmetered." That entire SaaS pricing model is enforced by — and only by — a rate limiter. Without it, free users consume paid-tier capacity and your unit economics collapse.
A flash sale, a viral tweet, a Black Friday rush — sudden 10× spikes can melt downstream services that scale slower than the front door. Rate limiting acts as a back-pressure valve, holding excess traffic at 429 while autoscalers catch up.
Before drawing a single box, pin down exactly what the limiter must do. The non-functional requirements are where this design gets hard.
Retry-After header)/search costs 1 token, /generate-image costs 50)"What does throttling mean?" sounds like a one-word answer, but production systems pick from three flavors depending on what kind of system they're protecting. The choice changes the user experience and the strictness of the guarantee.
The number of API requests cannot exceed the limit — ever. Period. Request 101 in a "100/min" window gets a flat 429. Use when: the resource you're protecting is a non-negotiable budget (paid OpenAI tokens, third-party SMS sends, financial transactions). Trade-off: a momentary 1-request overage on a benign user feels jarring; you waste no resources but lose some good will.
Allow some buffer — say, 10% — over the configured limit. A 100/min limiter actually only 429s at request 111. Use when: the protected resource is elastic (a stateless service that can stretch a bit) and you'd rather forgive a benign burst than punish honest users. Trade-off: the "real" cap is fuzzy, which makes capacity planning slightly harder.
Allow exceeding the limit if the system has spare capacity right now. Look at downstream CPU/queue depth — if all green, let the burst through; if any sign of stress, drop to hard cap. Use when: you have great real-time observability and want to maximize throughput. Trade-off: harder to implement; users can't predict whether a request will be allowed, which complicates client retry logic.
Five algorithms compete for the rate limiter's job. Each is a different answer to the same question: "how do we count requests in a time window?" The trade-offs are accuracy vs. memory vs. burst-tolerance. Production systems almost universally pick Sliding Window with Counters — but it helps to walk through why the simpler ones fail first.
The naive approach. Bucket time into fixed windows (e.g., one bucket per minute). Maintain a counter per (user, minute). Increment on each request; reset to zero when the minute rolls over. If the count exceeds the limit, 429.
The boundary attack: a clever attacker sends 10 requests at 0:59:59 (filling the first minute's budget) and 10 more at 1:00:00 (the new bucket's fresh budget). In a 1-second window straddling the boundary, they pushed 20 requests through a 10-per-minute limiter. The limit is broken by 2× at the worst possible time.
Memory: tiny — one counter per user. Accuracy: bad. Verdict: only for very loose, non-critical limits.
Store the timestamp of every single request from a user. On a new request: drop timestamps older than now - 60s, count what's left, allow if under limit, append the new timestamp. Perfectly accurate — the count is exactly "how many requests in the last 60 seconds".
Memory disaster: for a user making 500 req/hour at the limit, you store 500 timestamps. At 8 bytes each plus Redis sorted-set overhead (~40 bytes/entry) that's 24KB per active user. With 1M active users: 24GB. With 10M: 240GB. Memory blows up linearly with traffic — a free-tier user with 1 req/min costs 60 bytes; a power user with 500 costs 24KB. Workable for small scale, painful at internet scale.
The hybrid that wins. Bucket time at fine granularity (e.g., one bucket per minute, or per 10 seconds). For a "500 req/hour" limit using 1-minute buckets: store 60 counters per user (one per minute in the last hour). On a new request, sum the last 60 buckets — if the total is under 500, allow and increment the current bucket; otherwise 429.
Why this wins: as buckets roll off the back, the count adjusts smoothly — there's no "boundary moment" where 2× the limit slips through. And memory is bounded — 60 counters per user, regardless of how many requests they make.
Imagine a literal water bucket with a tap dripping in at constant rate R. The bucket has capacity B. Each request consumes one token (one drop). If the bucket has at least one token, the request succeeds and a token is removed. Empty bucket = 429.
Burst-tolerant by design: if a user is idle for 60 seconds and the tap fills the bucket to capacity B, they can fire B requests in quick succession before being throttled. Used heavily in API gateways (AWS API Gateway, Stripe) because it matches real user behavior — bursty work followed by quiet periods.
The opposite shape: a queue with a hole in the bottom that drains at constant rate R. Requests enter the queue; if the queue is full, they get 429. Requests leave the queue at rate R, smoothing any input burst into a perfectly steady output.
Use when: the downstream system can only handle a steady rate (e.g., a third-party API that hates bursts). Trade-off: adds queueing latency — a request might wait in the bucket before getting served.
| Algorithm | Accuracy | Memory / user | Burst tolerance | Verdict |
|---|---|---|---|---|
| Fixed Window | ❌ 2× breach at boundary | ~10 B | None | Loose limits only |
| Sliding Window Log | ✅ Perfect | ~24 KB | Low | Memory disaster at scale |
| Sliding Window Counters | ✅ Within 1 bucket | ~1.6 KB | Low | 🏆 Production default |
| Token Bucket | ✅ Good | ~20 B | High (configurable) | Best for bursty APIs |
| Leaky Bucket | ✅ Smooths perfectly | ~20 B + queue | Buffers, doesn't allow | Best for steady downstream |
The rate limiter exposes one synchronous call that every API request hits before reaching app code. Treat it as middleware — invisible from the outside, but called millions of times a second internally.
Internal API surface// Called by API gateway / app server middleware before any business logic checkRateLimit(api_key, identifier, endpoint) → { "allowed": boolean, "limit": 100, // configured cap "remaining": 42, // requests left in current window "retry_after": 17 // seconds until next slot opens (only when allowed=false) } // On allow → forward request to app server, set response headers: X-RateLimit-Limit: 100 X-RateLimit-Remaining: 42 X-RateLimit-Reset: 1730000000 // On deny → return immediately with: HTTP/1.1 429 Too Many Requests Retry-After: 17 { "error": "rate_limit_exceeded", "retry_after_seconds": 17 }
api_key identifies the caller (and which tier they're on). identifier is the entity being limited — usually the same as the API key, but sometimes the user_id or IP for hybrid limits. endpoint lets us apply per-endpoint multipliers (a /search request costs 1 token, a /generate-image costs 50). All three combine into the limiter key — typically {tier}:{identifier}:{endpoint_class}.This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart in a multi-server world, and the production shape where every box justifies itself.
Sketch the simplest possible system: each app server keeps an in-memory hashmap userId → count. On every request, the server increments its local counter and 429s if the count exceeds the limit. No network calls, no external state, blazing fast.
Three concrete failures emerge the moment traffic shows up:
The load balancer sprays the user's requests across 10 app pods. Each pod sees only 1/10 of the traffic and counts only its own slice. A user with a "100/min" limit can hit each of 10 pods 100 times = 1,000 requests/min through the limiter. The limit is meaningless in any cluster bigger than one.
Counts live only in RAM. A deploy, a crash, or an autoscaler scale-down wipes them. An attacker who's used 99/100 of their budget can force a pod restart (or just wait for the next deploy) and get a fresh budget. Worse — the budget is silently reset, with no audit trail.
Operations needs to answer "who's getting throttled the most? which endpoints? which tier?" With per-pod counters, that data is fragmented across 50 pods and gone the moment any of them restart. No dashboards, no abuse detection, no capacity planning.
The single most important insight in this design is that rate limiting is fundamentally a shared-state problem. Multiple app servers must agree on a single counter per limiter key. There are exactly two architectural shapes:
One Redis cluster holds every counter. Every app server hits Redis on every request to atomically increment and check. Pros: dead-simple semantics, perfect accuracy, single source of truth for analytics. Cons: Redis is now on the critical path — its latency and uptime are yours; bursty traffic creates contention on hot keys.
Used by: the vast majority of production systems (Stripe, GitHub, AWS API Gateway).
Each app server keeps a local count and periodically (every 100ms or so) syncs deltas to a central aggregator. The local check is a microsecond; the cluster-wide view is eventually consistent. Pros: sub-millisecond latency, no per-request network hop. Cons: brief over-limit possible during sync intervals; harder to reason about; rarely worth the complexity.
Used by: ultra-high-throughput edge systems (Cloudflare's older limiters) where the network hop to a central store is itself the bottleneck.
Most production systems pick the centralized Redis path. Redis pipelining, Lua scripting, and proximity (same data center as the app pods) keep the latency tax to under 1ms p99, which is well within the 5ms budget. We'll build the rest of this design around that choice.
Now the full picture. Every node is numbered — find its matching card below to see what it does and crucially what would break without it.
Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.
Anything that calls our API — a browser, a mobile app, a partner's server-side SDK, or a malicious script. From the client's point of view, the rate limiter is invisible until it isn't: most requests get a 200, but a small percentage come back as 429 with a Retry-After header telling them when to back off. Well-behaved clients honor that header; abusive ones ignore it and burn through 429s until they give up.
Solves: nothing on its own — but the client experience drives the whole 429+Retry-After contract. If the limiter just dropped requests silently, clients would retry harder and make things worse.
The front door. Terminates TLS, parses the HTTP request, extracts the api_key and user identity from headers/JWTs. In production this is nginx, Envoy, or a managed gateway like AWS API Gateway or Kong. The gateway runs the rate-limiter middleware as the very first plugin in its filter chain — before any auth, routing, or app code.
Solves: a single chokepoint where every external request must pass. Without a gateway tier, you'd have to bolt rate limiting into every microservice — duplicated code, inconsistent behavior, easy to miss.
The brain. A small library (or sidecar) loaded into the API gateway that runs on every request. Per request, it: builds the limiter key (e.g., "free:user-42:/search"), looks up the limit from the Configuration Service, runs the sliding-window algorithm against the Counter Store, and returns {allowed, retry_after}. Stays in-process for latency reasons — a sidecar adds 1-2ms of localhost loopback that we'd rather not pay.
Solves: running the algorithm itself. Without this layer, you'd be embedding rate-limit logic inside each handler — guaranteed to drift, guaranteed to be inconsistent across services.
Holds the rules: "free tier = 100/min on /search, 5/min on /login; paid tier = 10K/min and 100/min". Backed by a small DB (Postgres, etcd) and aggressively cached in the middleware (refreshed every 10 seconds). When an admin changes a limit in the UI, every middleware instance picks up the new value within seconds — no restarts.
Solves: hot-reloading limits and per-tier/per-endpoint customization. Without a config service, every limit change would require a code deploy across every gateway pod — a bad afternoon waiting to happen.
Where the actual counts live. A Redis cluster sharded by limiter key (consistent hashing) — for a "1 million users" workload at 1.6KB per user, that's 1.6GB of hot data, easily a small Redis cluster of 3 shards. Each request hits Redis exactly once via an atomic Lua script that does the read-window-sum-and-INCR in a single round trip. Redis's single-threaded event loop guarantees atomicity for free.
Solves: the shared-state problem from Pass 2. Every gateway pod writes to and reads from the same Redis, so 10 pods seeing 10 slices of one user's traffic still agree on the total. Without a centralized store, you're back to the N× breach.
The actual business logic — the thing the rate limiter exists to protect. Search service, payment service, ML inference, whatever. Receives only the requests that survived the rate-limiter check. From its point of view, traffic looks pre-shaped — no spikes above the cluster-wide cap.
Solves: being able to do its real job. Without rate limiting in front, the app server has to defensively size for peak abusive load (e.g., a 10× DDoS) — meaning 10× the infrastructure bill, idle most of the time.
The "you're throttled" responder. When the middleware decides a request should be denied, the throttle responder constructs the 429 response with the right headers (Retry-After, X-RateLimit-Reset), maybe a friendly JSON body, and sends it directly back to the client — without ever touching the app server. Often this is just a helper function inside the middleware, but conceptually it's its own concern.
Solves: giving clients machine-readable information about when to retry. Without proper headers, clients fall back to exponential backoff with jitter and may take minutes to recover from a 1-second throttle.
Every throttle event is fired async to a Kafka topic: {ts, key, endpoint, tier, count, limit}. Downstream consumers fan out: real-time dashboards ("which API keys are getting throttled the most right now?"), abuse detection ("this IP just hit 429 a thousand times — likely a script"), and capacity planning ("free-tier traffic is approaching 70% of capacity, time to push the upsell"). Crucially this is fire-and-forget — the middleware never waits for the event to be acknowledged.
Solves: visibility into why the system is throttling. Without analytics you can see throttles happening but never know which keys, which endpoints, or whether a customer is about to churn because they keep hitting their cap.
An admin interface where ops can adjust limits ("bump enterprise customer X to 50K/min"), maintain blacklists ("permanently block IP 192.0.2.1"), maintain allowlists ("internal services skip rate limiting"), and roll out new limit policies. Changes go through the Configuration Service ④ which then propagates to every middleware ③ within ~10 seconds.
Solves: giving humans control. Without it, every limit change requires a code review, deploy, and rollout — making the system unable to react in real time when an enterprise customer needs an emergency bump or an attack pattern is detected.
Sarah's data-pipeline script has a bug — instead of polling /api/orders once a minute, it polls every 100ms after a backend hiccup confused its retry logic. At 14:02:00 the script has fired 600 requests/sec for the last two seconds. Her plan limit is 100 req/min. Here's what happens:
GET /api/orders with her API key.free:sarah-api-key:orders and runs the Lua script against Redis ⑤. Lua atomically: sums the last 60 1-minute buckets → result = 147.HTTP 429, Retry-After: 38 (next slot opens in 38 seconds), X-RateLimit-Remaining: 0. Sent directly to Sarah's client. App Server ⑥ never saw the request.free:sarah-api-key:orders (deterministic from the request).Now we look inside the middleware. For a "500 req/hour" limit at 1-minute granularity, we maintain 60 counters per limiter key. The algorithm in pseudocode:
Sliding window with countersfunction checkAndIncrement(key, limit, windowSec, granularitySec): now = currentTimeSec() bucket = now / granularitySec // e.g. 1-min bucket id windowStart = now - windowSec // 1 hour ago // 1. Drop buckets older than windowStart redis.ZREMRANGEBYSCORE(key, 0, windowStart) // 2. Sum surviving buckets buckets = redis.ZRANGE(key, 0, -1, WITHSCORES) total = sum(buckets.values()) // 3. Decide if total >= limit: return { allowed: false, retryAfter: computeRetryAfter(buckets) } // 4. Increment current bucket redis.ZINCRBY(key, 1, bucket) redis.EXPIRE(key, windowSec + granularitySec) return { allowed: true, remaining: limit - total - 1 }
In production all four Redis ops are wrapped in a single Lua script sent to Redis once — atomic by construction (next section), and exactly one network round trip.
Per limiter key: 60 buckets × (timestamp 4B + counter 2B + Redis sorted-set overhead ~20B) = ~1.6 KB / key.
At 1 million active users (one key per user): ~1.6 GB of hot data — fits on a single Redis node with room to spare.
| Algorithm | Memory / user | 1M users | 10M users |
|---|---|---|---|
| Sliding Window Log (one entry per request, 500 req/hr cap) | ~12 KB | 12 GB | 120 GB |
| Sliding Window Counters (60 buckets at 1-min granularity) | ~1.6 KB | 1.6 GB | 16 GB |
| Memory savings | ~86% | ~10 GB saved | ~104 GB saved |
And the accuracy cost is at most one bucket's width of slop — at 1-minute granularity, a user might briefly see their effective limit be off by up to ~8 requests in a 500/hour cap (well under 2%). For most production use cases this is invisible.
Here's a subtle bug that bites every naive implementation. Picture two API gateway pods, each receiving one of Sarah's requests at the same millisecond. Both fetch count = 99 from Redis. Both check "99 < 100, allowed". Both increment to 100. Sarah just made request 101 in a 100/min window — and worse, two separate gateways each think they enforced the limit correctly. The bug is invisible, intermittent, and gets worse as you add more gateway pods.
Bundle "read window sum, check, INCR" into a single Lua script and ship it to Redis with EVAL. Redis runs Lua scripts atomically on its single-threaded event loop — nothing else can interleave. One network round trip. Simple to reason about. This is the production default.
Wrap the ops in a Redis transaction. Atomic execution, but no conditional logic inside — you can't say "INCR only if the sum is below the limit" in a MULTI block. Forces a two-phase pattern (read, compute, write) with optimistic retry, which gets ugly fast.
Optimistic concurrency: WATCH the key, read, decide, EXEC. If anyone modified the key between WATCH and EXEC, the EXEC fails and you retry. Works but performance degrades sharply under contention — for hot keys (e.g., a viral API key getting hammered), retry storms can destroy throughput.
One Redis box can comfortably do ~100K ops/sec. At 50K req/sec across all our endpoints, one node holds. At 500K req/sec, it doesn't. We shard.
Compute shard = consistent_hash(limiter_key) mod N. The same key always goes to the same shard, so reads and writes for one user always coordinate on one Redis node. Use consistent hashing (not modulo) so adding/removing a shard relocates only 1/N of keys, not all of them — important when you grow from 3 to 6 shards next year.
Rate limiting is always a write — every request increments a counter. Read replicas only buy you free read scaling, which we don't need. Stick with primaries; scale horizontally by sharding instead. Replicas only earn their keep as failover targets.
The limit values ("free tier = 100/min") change rarely — every middleware caches them in-process for ~10 seconds, refreshing in the background. This avoids hitting the Configuration Service ④ on every request. The counter can never be cached in-process for the reasons in §7 Pass 1, but the limit lookup absolutely should be.
"Limit by what?" is one of the most consequential design questions and it's surprisingly nuanced. Each identifier has a different blast radius and different blind spots.
Pros: works before any login or auth — the only option for protecting /login, /signup, and anonymous endpoints. Catches attackers who haven't signed up.
Cons: NAT / shared IPs — an entire office building, college campus, or coffee shop sits behind one IP. Throttling that IP punishes hundreds of innocent users. IPv6 — the address space is huge, an attacker can rotate through millions of IPs trivially. Mobile carriers rotate IPs frequently.
Pros: precise and fair — tied to one human, regardless of network. The right choice for any authenticated endpoint where you can identify the user.
Cons: requires login first — useless on the auth endpoints themselves. An attacker brute-forcing POST /login {user: "sarah"} doesn't have their user_id; they have Sarah's email. You'd be limiting Sarah, not the attacker.
Pros: the right choice for B2B APIs — every paying customer gets a key, and the limit ties to billing. Perfect for SaaS pricing tiers.
Cons: useless for unauthenticated traffic. Also, customers may share keys across many services — you can't distinguish noisy-neighbor microservices behind one key.
Login is the perfect storm: you can't limit by user_id (the attacker doesn't have their own — they're guessing Sarah's), and limiting only by IP gives an attacker on a botnet free reign. The production answer: limit by both IP AND target username, simultaneously.
Login endpoint hybrid limitPOST /login { "username": "sarah", "password": "..." }
// Two independent limiter checks — fail if either trips
checkRateLimit(key="login:ip:" + request.ip, limit=20, window=60)
checkRateLimit(key="login:user:" + body.username, limit=5, window=60)
// IP cap = 20/min: stops one attacker from hammering many users
// User cap = 5/min: stops a botnet from each-trying-once at one target user
The IP limit catches credential-stuffing scripts. The username limit catches distributed brute-force where each bot tries Sarah's password once. Together they close both holes.
Not all customers are equal, and not all endpoints are equal. The configuration service holds a 2D matrix: tier × endpoint → limit.
| Tier | /search (cheap) | /api/orders (medium) | /generate-image (expensive) |
|---|---|---|---|
| Free | 100 / min | 20 / min | 5 / hour |
| Pro | 10K / min | 1K / min | 100 / hour |
| Enterprise | unmetered | 50K / min | 5K / hour |
Some requests cost more downstream than others. A search query touches Elasticsearch for ~5ms; an image-generation request burns 2 seconds of GPU. Model the difference with token costs: /search consumes 1 token, /generate-image consumes 50. The user has a single per-minute budget of N tokens, deducted per request. This is exactly how OpenAI's API limits work.
The middleware polls the Configuration Service every 10 seconds for updates (or subscribes to a pub/sub channel for instant invalidation). Admin changes a limit in the UI ⑨ → Configuration Service ④ writes new value → all middleware ③ instances pick it up within seconds. Critical: never restart gateway pods to push a limit change — that's an operational nightmare and the whole reason this service exists.
The rate limiter sits on the critical path of every request. If Redis goes down, the middleware can't check limits. You have two choices, and they're both painful.
If Redis is unreachable, the middleware logs a loud alarm and allows the request. The protected service handles the traffic without rate limiting for the duration of the outage.
Pros: users see no impact from a limiter outage. Your availability SLO is preserved. Your engineers can fix Redis without the entire site being down.
Cons: during the outage, you have no protection — an attacker who notices could pile on. A noisy customer could exhaust downstream capacity.
If Redis is unreachable, the middleware denies all requests with 429 (or 503). The protected service stays safe but the user-visible service is effectively down.
Pros: backend is protected from any abuse during the outage. Strong consistency with the "limits are sacred" promise.
Cons: a Redis blip takes down your entire API. Your availability SLO craters. Customers churn.
Almost every production system picks fail open. The reasoning: rate limiting is a defense, not a feature — losing it for 60 seconds is a manageable security risk; losing your entire API is a business disaster. Three additional safeguards make fail-open less scary:
Wrap Redis calls in a circuit breaker. After 5 consecutive failures, the breaker opens and the middleware stops calling Redis for 30 seconds — saving the latency of timeout-waiting on every request. Auto-recovery via probe requests.
Each middleware also runs a tiny per-pod sliding-window counter as a fallback. When Redis is down, the middleware uses local counts only. The cluster-wide guarantee weakens to "each pod still enforces its share" — not perfect but much better than zero protection.
Every fail-open event pages on-call. The expectation is: "we're vulnerable right now, fix Redis NOW". Combined with the L1 fallback, the window of total exposure is short.
Retry-After header includes a random jitter (e.g., 17 ± 5 seconds) so even a million clients hitting 429 at the same second don't all retry at the same second. Standard backoff-with-jitter pattern./heavy-endpoint costs 5 tokens — they're actually at 250/100. (3) Time skew: their dashboard polls every 60s but the sliding window updates every second; the dashboard lags. Fix: surface "tokens remaining" not "requests remaining" in the dashboard, with a real-time counter from the same Redis the limiter reads.