API Rate Limiter — HLD Design

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →

Step 1

What is a Rate Limiter?

It's 02:14 AM. An attacker sitting behind a botnet starts hammering Sarah's bank login at 10,000 password guesses per second. With a six-character password, the entire keyspace is brute-forceable in under 90 seconds. Sarah's password is "summer22" — it falls on guess number 4,200,000, and her savings drain into a wallet in another country before her phone even buzzes. A rate limiter is the small piece of code that, between request #5 and request #6 in any 60-second window, says: "that's enough" and returns HTTP 429 Too Many Requests. The attack stops dead.

More formally: a rate limiter is a gate that decides — for every incoming request — whether the caller has stayed within an allowed budget (e.g., "100 requests per minute per user"). If yes, the request flows through. If no, the gate slams shut and returns 429, often with a Retry-After header telling the client when to try again. Conceptually it's the same thing as the bouncer at a club checking how many people from your group already went in tonight.

The two questions that drive every design decision below: (1) How do we count requests accurately, in a way that doesn't itself become the bottleneck? (2) How do we keep that counter consistent across dozens of API gateway pods that all see different slices of the same user's traffic?

Step 2

Why We Need It

Rate limiting isn't a "nice to have" — it sits between your app and a long list of disasters. Some of those disasters are malicious; some are friendly fire from your own customers' buggy code. Either way, the symptoms look the same: traffic spike, latency tank, paged on-calls.

🛡️ DDoS & brute-force

Distributed denial-of-service attacks try to drown your servers in traffic. Brute-force attacks try to drown a single endpoint (login, password reset, OTP verify) in guesses. Rate limiting throttles both at the edge before they reach app code.

🤖 Abusive scrapers

A scraper that ignores robots.txt can hit /products/:id for every product in your catalog in minutes — pushing your DB cache hit rate from 95% to 30% and slowing every other user. Rate limit by IP and the scraper crawls you at human speed instead.

🐞 Runaway clients

The most common offender isn't malicious — it's a customer's mobile app stuck in a retry loop after a backend hiccup. One bug pushes 2 million phones to retry every 100ms. Without a limiter, that takes you down. With one, each phone gets 429ed until the app ships a fix.

💸 Revenue & cost control

Every request you serve costs CPU, bandwidth, and downstream API calls (Stripe, Twilio, OpenAI). A scraper hammering an endpoint that internally calls GPT-4 can ring up $50K in a weekend. Rate limits cap that exposure to a known maximum.

💰 Tiered pricing

"Free tier: 100 req/min. Pro: 10K req/min. Enterprise: unmetered." That entire SaaS pricing model is enforced by — and only by — a rate limiter. Without it, free users consume paid-tier capacity and your unit economics collapse.

📉 Smoothing spikes

A flash sale, a viral tweet, a Black Friday rush — sudden 10× spikes can melt downstream services that scale slower than the front door. Rate limiting acts as a back-pressure valve, holding excess traffic at 429 while autoscalers catch up.

Before-and-after numbers: a typical login endpoint without rate limiting accepts 10,000 brute-force attempts/sec. With "5 failed attempts per IP per minute", the attacker gets 5 × 1,440 minutes = 7,200 attempts per day from one IP — a 100× slowdown that turns a 90-second crack into 16 years.

Step 3

Requirements & Goals

Before drawing a single box, pin down exactly what the limiter must do. The non-functional requirements are where this design gets hard.

✅ Functional Requirements

Limit requests per entity (user, IP, API key, or any combination) within a sliding time window
Limits must work across a distributed cluster — a user calling 10 different gateway pods still hits one shared budget
Return a clear error on throttle (HTTP 429 + Retry-After header)
Per-endpoint and per-tier limits (e.g., free tier gets 100/min, paid 10K/min; /search costs 1 token, /generate-image costs 50)
Hot-reload config — admins can change limits without restarting anything

⚙️ Non-Functional Requirements

Highly available — if the limiter goes down, the app it protects must NOT go down with it
Low latency — adds under 5ms to every request, p99
Accurate — must not let a clever attacker game the time-window boundary to send 2× the limit
Memory-efficient — counter storage must scale to millions of unique limiter keys

The hardest non-functional requirement is the first one: the rate limiter is on the critical path of every single request. If it adds 50ms, your whole API gets 50ms slower. If its database goes down, you have a choice between "block all traffic" (safe, but takes you down) or "let all traffic through" (available, but you lose protection). That trade-off is §13.

Step 4

Throttling Types

"What does throttling mean?" sounds like a one-word answer, but production systems pick from three flavors depending on what kind of system they're protecting. The choice changes the user experience and the strictness of the guarantee.

🪨 Hard Throttling

The number of API requests cannot exceed the limit — ever. Period. Request 101 in a "100/min" window gets a flat 429. Use when: the resource you're protecting is a non-negotiable budget (paid OpenAI tokens, third-party SMS sends, financial transactions). Trade-off: a momentary 1-request overage on a benign user feels jarring; you waste no resources but lose some good will.

🪶 Soft Throttling

Allow some buffer — say, 10% — over the configured limit. A 100/min limiter actually only 429s at request 111. Use when: the protected resource is elastic (a stateless service that can stretch a bit) and you'd rather forgive a benign burst than punish honest users. Trade-off: the "real" cap is fuzzy, which makes capacity planning slightly harder.

🌊 Elastic / Dynamic

Allow exceeding the limit if the system has spare capacity right now. Look at downstream CPU/queue depth — if all green, let the burst through; if any sign of stress, drop to hard cap. Use when: you have great real-time observability and want to maximize throughput. Trade-off: harder to implement; users can't predict whether a request will be allowed, which complicates client retry logic.

Pick one based on what the limiter protects: third-party paid APIs → hard. Your own stateless web tier → elastic. SaaS pricing tiers → hard (otherwise free users drift into paid territory and your billing falls apart).

Step 5 · CORE

Algorithms — the Core Design Choice

Five algorithms compete for the rate limiter's job. Each is a different answer to the same question: "how do we count requests in a time window?" The trade-offs are accuracy vs. memory vs. burst-tolerance. Production systems almost universally pick Sliding Window with Counters — but it helps to walk through why the simpler ones fail first.

Algorithm 1 — Fixed Window Counter

The naive approach. Bucket time into fixed windows (e.g., one bucket per minute). Maintain a counter per (user, minute). Increment on each request; reset to zero when the minute rolls over. If the count exceeds the limit, 429.

flowchart LR R1["Request at 0:59"] --> B1["Bucket 0:00 — 0:59 — count = 10"] R2["Request at 1:00"] --> B2["Bucket 1:00 — 1:59 — count = 10"] B1 -.problem.-> X["20 requests in 1 second through a 10-per-minute limiter"] style B1 fill:#171d27,stroke:#e8743b,color:#d4dae5 style B2 fill:#171d27,stroke:#e8743b,color:#d4dae5 style X fill:#171d27,stroke:#e05252,color:#d4dae5

The boundary attack: a clever attacker sends 10 requests at 0:59:59 (filling the first minute's budget) and 10 more at 1:00:00 (the new bucket's fresh budget). In a 1-second window straddling the boundary, they pushed 20 requests through a 10-per-minute limiter. The limit is broken by 2× at the worst possible time.

Memory: tiny — one counter per user. Accuracy: bad. Verdict: only for very loose, non-critical limits.

Algorithm 2 — Rolling / Sliding Window Log

Store the timestamp of every single request from a user. On a new request: drop timestamps older than now - 60s, count what's left, allow if under limit, append the new timestamp. Perfectly accurate — the count is exactly "how many requests in the last 60 seconds".

Memory disaster: for a user making 500 req/hour at the limit, you store 500 timestamps. At 8 bytes each plus Redis sorted-set overhead (~40 bytes/entry) that's 24KB per active user. With 1M active users: 24GB. With 10M: 240GB. Memory blows up linearly with traffic — a free-tier user with 1 req/min costs 60 bytes; a power user with 500 costs 24KB. Workable for small scale, painful at internet scale.

Algorithm 3 — Sliding Window with Counters ← production winner

The hybrid that wins. Bucket time at fine granularity (e.g., one bucket per minute, or per 10 seconds). For a "500 req/hour" limit using 1-minute buckets: store 60 counters per user (one per minute in the last hour). On a new request, sum the last 60 buckets — if the total is under 500, allow and increment the current bucket; otherwise 429.

Why this wins: as buckets roll off the back, the count adjusts smoothly — there's no "boundary moment" where 2× the limit slips through. And memory is bounded — 60 counters per user, regardless of how many requests they make.

Algorithm 4 — Token Bucket

Imagine a literal water bucket with a tap dripping in at constant rate R. The bucket has capacity B. Each request consumes one token (one drop). If the bucket has at least one token, the request succeeds and a token is removed. Empty bucket = 429.

Burst-tolerant by design: if a user is idle for 60 seconds and the tap fills the bucket to capacity B, they can fire B requests in quick succession before being throttled. Used heavily in API gateways (AWS API Gateway, Stripe) because it matches real user behavior — bursty work followed by quiet periods.

Algorithm 5 — Leaky Bucket

The opposite shape: a queue with a hole in the bottom that drains at constant rate R. Requests enter the queue; if the queue is full, they get 429. Requests leave the queue at rate R, smoothing any input burst into a perfectly steady output.

Use when: the downstream system can only handle a steady rate (e.g., a third-party API that hates bursts). Trade-off: adds queueing latency — a request might wait in the bucket before getting served.

Side-by-side

Algorithm	Accuracy	Memory / user	Burst tolerance	Verdict
Fixed Window	❌ 2× breach at boundary	`~10 B`	None	Loose limits only
Sliding Window Log	✅ Perfect	`~24 KB`	Low	Memory disaster at scale
Sliding Window Counters	✅ Within 1 bucket	`~1.6 KB`	Low	🏆 Production default
Token Bucket	✅ Good	`~20 B`	High (configurable)	Best for bursty APIs
Leaky Bucket	✅ Smooths perfectly	`~20 B + queue`	Buffers, doesn't allow	Best for steady downstream

The interview move: volunteer Fixed Window first, walk through the boundary attack, propose Sliding Window Log as the "perfectly accurate" answer, point out its memory problem, then arrive at Sliding Window Counters as the production winner. If asked about bursty APIs (Stripe, Twilio), pivot to Token Bucket.

Step 6

System APIs

The rate limiter exposes one synchronous call that every API request hits before reaching app code. Treat it as middleware — invisible from the outside, but called millions of times a second internally.

Internal API surface

// Called by API gateway / app server middleware before any business logic
checkRateLimit(api_key, identifier, endpoint) → {
  "allowed":      boolean,
  "limit":        100,         // configured cap
  "remaining":    42,          // requests left in current window
  "retry_after":  17           // seconds until next slot opens (only when allowed=false)
}

// On allow → forward request to app server, set response headers:
X-RateLimit-Limit:     100
X-RateLimit-Remaining: 42
X-RateLimit-Reset:     1730000000

// On deny → return immediately with:
HTTP/1.1 429 Too Many Requests
Retry-After: 17
{ "error": "rate_limit_exceeded", "retry_after_seconds": 17 }

Why three identifiers? api_key identifies the caller (and which tier they're on). identifier is the entity being limited — usually the same as the API key, but sometimes the user_id or IP for hybrid limits. endpoint lets us apply per-endpoint multipliers (a /search request costs 1 token, a /generate-image costs 50). All three combine into the limiter key — typically {tier}:{identifier}:{endpoint_class}.

Step 7 · CORE

High-Level Architecture — From Naive to Production

This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart in a multi-server world, and the production shape where every box justifies itself.

Pass 1 — The naive design (and why it breaks)

Sketch the simplest possible system: each app server keeps an in-memory hashmap userId → count. On every request, the server increments its local counter and 429s if the count exceeds the limit. No network calls, no external state, blazing fast.

flowchart LR C["Client"] --> LB["Load Balancer"] LB --> A1["App Server 1 — local map"] LB --> A2["App Server 2 — local map"] LB --> A3["App Server 3 — local map"] style A1 fill:#171d27,stroke:#e8743b,color:#d4dae5 style A2 fill:#171d27,stroke:#e8743b,color:#d4dae5 style A3 fill:#171d27,stroke:#e8743b,color:#d4dae5

Three concrete failures emerge the moment traffic shows up:

💥 N× breach across N servers

The load balancer sprays the user's requests across 10 app pods. Each pod sees only 1/10 of the traffic and counts only its own slice. A user with a "100/min" limit can hit each of 10 pods 100 times = 1,000 requests/min through the limiter. The limit is meaningless in any cluster bigger than one.

💥 Pod restart loses all counts

Counts live only in RAM. A deploy, a crash, or an autoscaler scale-down wipes them. An attacker who's used 99/100 of their budget can force a pod restart (or just wait for the next deploy) and get a fresh budget. Worse — the budget is silently reset, with no audit trail.

💥 No central view for analytics

Operations needs to answer "who's getting throttled the most? which endpoints? which tier?" With per-pod counters, that data is fragmented across 50 pods and gone the moment any of them restart. No dashboards, no abuse detection, no capacity planning.

Pass 2 — The mental model: Centralized state vs. Distributed counters

The single most important insight in this design is that rate limiting is fundamentally a shared-state problem. Multiple app servers must agree on a single counter per limiter key. There are exactly two architectural shapes:

🏛️ Centralized counter store

One Redis cluster holds every counter. Every app server hits Redis on every request to atomically increment and check. Pros: dead-simple semantics, perfect accuracy, single source of truth for analytics. Cons: Redis is now on the critical path — its latency and uptime are yours; bursty traffic creates contention on hot keys.

Used by: the vast majority of production systems (Stripe, GitHub, AWS API Gateway).

🌐 Per-server counters with periodic sync

Each app server keeps a local count and periodically (every 100ms or so) syncs deltas to a central aggregator. The local check is a microsecond; the cluster-wide view is eventually consistent. Pros: sub-millisecond latency, no per-request network hop. Cons: brief over-limit possible during sync intervals; harder to reason about; rarely worth the complexity.

Used by: ultra-high-throughput edge systems (Cloudflare's older limiters) where the network hop to a central store is itself the bottleneck.

Most production systems pick the centralized Redis path. Redis pipelining, Lua scripting, and proximity (same data center as the app pods) keep the latency tax to under 1ms p99, which is well within the 5ms budget. We'll build the rest of this design around that choice.

Pass 3 — The production shape

Now the full picture. Every node is numbered — find its matching card below to see what it does and crucially what would break without it.

flowchart TB CL["① Client — Browser, Mobile App, Server SDK"] subgraph EDGE["Request Plane"] GW["② API Gateway / Web Server"] MW["③ Rate Limiter Middleware"] APP["⑥ App Server"] TR["⑦ Throttle Response Service"] end subgraph COUNT["Counter Plane"] REDIS[("⑤ Counter Store — Redis Cluster, sharded")] end subgraph CFG["Config Plane"] CONF["④ Limit Configuration Service"] UI["⑨ Configuration UI / API"] end subgraph OBS["Observability"] AN["⑧ Analytics Pipeline — Kafka, Dashboards"] end CL --> GW GW --> MW MW -->|"INCR + check"| REDIS MW -->|"allowed"| APP MW -->|"denied"| TR TR -->|"429 + Retry-After"| CL APP -->|"200 OK"| CL MW -.reads limits.-> CONF UI -.publishes.-> CONF MW -.throttle events.-> AN style CL fill:#e8743b,stroke:#e8743b,color:#fff style GW fill:#171d27,stroke:#9b72cf,color:#d4dae5 style MW fill:#171d27,stroke:#e8743b,color:#d4dae5 style APP fill:#171d27,stroke:#38b265,color:#d4dae5 style TR fill:#171d27,stroke:#e05252,color:#d4dae5 style REDIS fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style CONF fill:#171d27,stroke:#4a90d9,color:#d4dae5 style UI fill:#171d27,stroke:#4a90d9,color:#d4dae5 style AN fill:#171d27,stroke:#d4a838,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.

① Client

Anything that calls our API — a browser, a mobile app, a partner's server-side SDK, or a malicious script. From the client's point of view, the rate limiter is invisible until it isn't: most requests get a 200, but a small percentage come back as 429 with a Retry-After header telling them when to back off. Well-behaved clients honor that header; abusive ones ignore it and burn through 429s until they give up.

Solves: nothing on its own — but the client experience drives the whole 429+Retry-After contract. If the limiter just dropped requests silently, clients would retry harder and make things worse.

② API Gateway / Web Server

The front door. Terminates TLS, parses the HTTP request, extracts the api_key and user identity from headers/JWTs. In production this is nginx, Envoy, or a managed gateway like AWS API Gateway or Kong. The gateway runs the rate-limiter middleware as the very first plugin in its filter chain — before any auth, routing, or app code.

Solves: a single chokepoint where every external request must pass. Without a gateway tier, you'd have to bolt rate limiting into every microservice — duplicated code, inconsistent behavior, easy to miss.

③ Rate Limiter Middleware

The brain. A small library (or sidecar) loaded into the API gateway that runs on every request. Per request, it: builds the limiter key (e.g., "free:user-42:/search"), looks up the limit from the Configuration Service, runs the sliding-window algorithm against the Counter Store, and returns {allowed, retry_after}. Stays in-process for latency reasons — a sidecar adds 1-2ms of localhost loopback that we'd rather not pay.

Solves: running the algorithm itself. Without this layer, you'd be embedding rate-limit logic inside each handler — guaranteed to drift, guaranteed to be inconsistent across services.

④ Limit Configuration Service

Holds the rules: "free tier = 100/min on /search, 5/min on /login; paid tier = 10K/min and 100/min". Backed by a small DB (Postgres, etcd) and aggressively cached in the middleware (refreshed every 10 seconds). When an admin changes a limit in the UI, every middleware instance picks up the new value within seconds — no restarts.

Solves: hot-reloading limits and per-tier/per-endpoint customization. Without a config service, every limit change would require a code deploy across every gateway pod — a bad afternoon waiting to happen.

⑤ Counter Store (Redis Cluster)

Where the actual counts live. A Redis cluster sharded by limiter key (consistent hashing) — for a "1 million users" workload at 1.6KB per user, that's 1.6GB of hot data, easily a small Redis cluster of 3 shards. Each request hits Redis exactly once via an atomic Lua script that does the read-window-sum-and-INCR in a single round trip. Redis's single-threaded event loop guarantees atomicity for free.

Solves: the shared-state problem from Pass 2. Every gateway pod writes to and reads from the same Redis, so 10 pods seeing 10 slices of one user's traffic still agree on the total. Without a centralized store, you're back to the N× breach.

⑥ App Server

The actual business logic — the thing the rate limiter exists to protect. Search service, payment service, ML inference, whatever. Receives only the requests that survived the rate-limiter check. From its point of view, traffic looks pre-shaped — no spikes above the cluster-wide cap.

Solves: being able to do its real job. Without rate limiting in front, the app server has to defensively size for peak abusive load (e.g., a 10× DDoS) — meaning 10× the infrastructure bill, idle most of the time.

⑦ Throttle Response Service

The "you're throttled" responder. When the middleware decides a request should be denied, the throttle responder constructs the 429 response with the right headers (Retry-After, X-RateLimit-Reset), maybe a friendly JSON body, and sends it directly back to the client — without ever touching the app server. Often this is just a helper function inside the middleware, but conceptually it's its own concern.

Solves: giving clients machine-readable information about when to retry. Without proper headers, clients fall back to exponential backoff with jitter and may take minutes to recover from a 1-second throttle.

⑧ Analytics Pipeline

Every throttle event is fired async to a Kafka topic: {ts, key, endpoint, tier, count, limit}. Downstream consumers fan out: real-time dashboards ("which API keys are getting throttled the most right now?"), abuse detection ("this IP just hit 429 a thousand times — likely a script"), and capacity planning ("free-tier traffic is approaching 70% of capacity, time to push the upsell"). Crucially this is fire-and-forget — the middleware never waits for the event to be acknowledged.

Solves: visibility into why the system is throttling. Without analytics you can see throttles happening but never know which keys, which endpoints, or whether a customer is about to churn because they keep hitting their cap.

⑨ Configuration UI / API

An admin interface where ops can adjust limits ("bump enterprise customer X to 50K/min"), maintain blacklists ("permanently block IP 192.0.2.1"), maintain allowlists ("internal services skip rate limiting"), and roll out new limit policies. Changes go through the Configuration Service ④ which then propagates to every middleware ③ within ~10 seconds.

Solves: giving humans control. Without it, every limit change requires a code review, deploy, and rollout — making the system unable to react in real time when an enterprise customer needs an emergency bump or an attack pattern is detected.

Concrete walkthrough — Sarah's runaway script at 14:02

Sarah's data-pipeline script has a bug — instead of polling /api/orders once a minute, it polls every 100ms after a backend hiccup confused its retry logic. At 14:02:00 the script has fired 600 requests/sec for the last two seconds. Her plan limit is 100 req/min. Here's what happens:

Single request — request #101 at 14:02:01

Sarah's script ① fires GET /api/orders with her API key.
Request lands on one of 8 API Gateway ② pods via the load balancer.
The Rate Limiter Middleware ③ reads from its in-process cache: Configuration Service ④ says her tier is "free, 100/min on this endpoint".
Middleware builds key free:sarah-api-key:orders and runs the Lua script against Redis ⑤. Lua atomically: sums the last 60 1-minute buckets → result = 147.
147 > 100 → middleware tells the Throttle Response Service ⑦: deny.
Response: HTTP 429, Retry-After: 38 (next slot opens in 38 seconds), X-RateLimit-Remaining: 0. Sent directly to Sarah's client. App Server ⑥ never saw the request.
Middleware fires a fire-and-forget event to Kafka → Analytics ⑧ sees Sarah's key throttled. Total elapsed: 3.2ms.

Cluster view — how all 8 pods stay in sync

Sarah's 600 req/sec is sprayed across 8 gateway pods by the LB — each pod sees roughly 75 req/sec from her.
Each pod's middleware computes the same key free:sarah-api-key:orders (deterministic from the request).
All 8 pods INCR the same Redis key — Redis's single-threaded event loop serializes the writes.
Whichever pod increments past 100 first sees a count over the limit; it 429s. The next pod's INCR returns 101+ and also 429s.
Within ~150ms, all 8 pods are 429-ing all of Sarah's traffic — the cluster behaves as if it were one server with one counter.
Meanwhile a different user, Raj, who's well under his quota, sails through unaffected — his key hashes to a different Redis shard and his counter is independent.

So what: the architecture is built around three insights — (1) rate limiting is a shared-state problem, so we centralize counters in Redis instead of per-pod maps; (2) the algorithm runs in-process in the gateway middleware to keep the per-request latency under 5ms; (3) config and analytics live on separate planes, so admins can change limits in real time and ops can see what's being throttled, without slowing down the request path. Every box in the production diagram removes one of the failure modes from Pass 1.

Step 8

Sliding Window with Counters — The Algorithm in Detail

Now we look inside the middleware. For a "500 req/hour" limit at 1-minute granularity, we maintain 60 counters per limiter key. The algorithm in pseudocode:

Sliding window with counters

function checkAndIncrement(key, limit, windowSec, granularitySec):
  now      = currentTimeSec()
  bucket   = now / granularitySec               // e.g. 1-min bucket id
  windowStart = now - windowSec                 // 1 hour ago

  // 1. Drop buckets older than windowStart
  redis.ZREMRANGEBYSCORE(key, 0, windowStart)

  // 2. Sum surviving buckets
  buckets = redis.ZRANGE(key, 0, -1, WITHSCORES)
  total   = sum(buckets.values())

  // 3. Decide
  if total >= limit:
    return { allowed: false, retryAfter: computeRetryAfter(buckets) }

  // 4. Increment current bucket
  redis.ZINCRBY(key, 1, bucket)
  redis.EXPIRE(key, windowSec + granularitySec)
  return { allowed: true, remaining: limit - total - 1 }

In production all four Redis ops are wrapped in a single Lua script sent to Redis once — atomic by construction (next section), and exactly one network round trip.

Memory math — the punch line of why this algorithm wins

Per limiter key: 60 buckets × (timestamp 4B + counter 2B + Redis sorted-set overhead ~20B) = ~1.6 KB / key.

At 1 million active users (one key per user): ~1.6 GB of hot data — fits on a single Redis node with room to spare.

Algorithm	Memory / user	1M users	10M users
Sliding Window Log (one entry per request, 500 req/hr cap)	`~12 KB`	`12 GB`	`120 GB`
Sliding Window Counters (60 buckets at 1-min granularity)	`~1.6 KB`	`1.6 GB`	`16 GB`
Memory savings	~86%	~10 GB saved	~104 GB saved

And the accuracy cost is at most one bucket's width of slop — at 1-minute granularity, a user might briefly see their effective limit be off by up to ~8 requests in a 500/hour cap (well under 2%). For most production use cases this is invisible.

The granularity dial: finer buckets (10s instead of 60s) → more accurate, more memory (360 buckets/key × 26B = 9KB/key). Coarser buckets (5min) → less accurate, less memory. Pick the smallest bucket whose total memory still fits comfortably — usually 1 minute for hour-window limits.

Step 9

Atomicity & Race Conditions

Here's a subtle bug that bites every naive implementation. Picture two API gateway pods, each receiving one of Sarah's requests at the same millisecond. Both fetch count = 99 from Redis. Both check "99 < 100, allowed". Both increment to 100. Sarah just made request 101 in a 100/min window — and worse, two separate gateways each think they enforced the limit correctly. The bug is invisible, intermittent, and gets worse as you add more gateway pods.

sequenceDiagram participant G1 as Gateway A participant G2 as Gateway B participant R as Redis Note over G1,G2: Both receive request at the same millisecond G1->>R: GET counter R-->>G1: 99 G2->>R: GET counter R-->>G2: 99 Note over G1,G2: Both check "99 < 100, allowed" G1->>R: SET counter = 100 G2->>R: SET counter = 100 Note over R: Counter = 100 — but TWO requests went through!

Three ways to fix it

🟢 Lua script (winner)

Bundle "read window sum, check, INCR" into a single Lua script and ship it to Redis with EVAL. Redis runs Lua scripts atomically on its single-threaded event loop — nothing else can interleave. One network round trip. Simple to reason about. This is the production default.

🟡 MULTI / EXEC

Wrap the ops in a Redis transaction. Atomic execution, but no conditional logic inside — you can't say "INCR only if the sum is below the limit" in a MULTI block. Forces a two-phase pattern (read, compute, write) with optimistic retry, which gets ugly fast.

🟡 WATCH + CAS

Optimistic concurrency: WATCH the key, read, decide, EXEC. If anyone modified the key between WATCH and EXEC, the EXEC fails and you retry. Works but performance degrades sharply under contention — for hot keys (e.g., a viral API key getting hammered), retry storms can destroy throughput.

Lua wins for one reason above all: it's the only option that lets you express "atomically: do the math, decide, mutate, and tell me the result, in one network call." MULTI lacks the conditional; WATCH lacks the contention safety. In every production rate limiter I've ever read the source of, Lua is the answer.

Step 10

Data Sharding & Caching

One Redis box can comfortably do ~100K ops/sec. At 50K req/sec across all our endpoints, one node holds. At 500K req/sec, it doesn't. We shard.

✅ Shard by limiter key (consistent hashing)

Compute shard = consistent_hash(limiter_key) mod N. The same key always goes to the same shard, so reads and writes for one user always coordinate on one Redis node. Use consistent hashing (not modulo) so adding/removing a shard relocates only 1/N of keys, not all of them — important when you grow from 3 to 6 shards next year.

❌ Read replicas don't help

Rate limiting is always a write — every request increments a counter. Read replicas only buy you free read scaling, which we don't need. Stick with primaries; scale horizontally by sharding instead. Replicas only earn their keep as failover targets.

In-process L1 cache for limit configuration

The limit values ("free tier = 100/min") change rarely — every middleware caches them in-process for ~10 seconds, refreshing in the background. This avoids hitting the Configuration Service ④ on every request. The counter can never be cached in-process for the reasons in §7 Pass 1, but the limit lookup absolutely should be.

Sharding strategy summary: per-user limits → shard by user_id. Per-IP limits → shard by IP. Per-API-key limits → shard by api_key. The choice of identifier is also the choice of partitioning key — pick the one that distributes hot keys evenly.

Step 11

Limit by IP vs. User vs. API Key

"Limit by what?" is one of the most consequential design questions and it's surprisingly nuanced. Each identifier has a different blast radius and different blind spots.

🌐 By IP

Pros: works before any login or auth — the only option for protecting /login, /signup, and anonymous endpoints. Catches attackers who haven't signed up.

Cons: NAT / shared IPs — an entire office building, college campus, or coffee shop sits behind one IP. Throttling that IP punishes hundreds of innocent users. IPv6 — the address space is huge, an attacker can rotate through millions of IPs trivially. Mobile carriers rotate IPs frequently.

👤 By User ID

Pros: precise and fair — tied to one human, regardless of network. The right choice for any authenticated endpoint where you can identify the user.

Cons: requires login first — useless on the auth endpoints themselves. An attacker brute-forcing POST /login {user: "sarah"} doesn't have their user_id; they have Sarah's email. You'd be limiting Sarah, not the attacker.

🔑 By API Key

Pros: the right choice for B2B APIs — every paying customer gets a key, and the limit ties to billing. Perfect for SaaS pricing tiers.

Cons: useless for unauthenticated traffic. Also, customers may share keys across many services — you can't distinguish noisy-neighbor microservices behind one key.

The hybrid pattern — login endpoints

Login is the perfect storm: you can't limit by user_id (the attacker doesn't have their own — they're guessing Sarah's), and limiting only by IP gives an attacker on a botnet free reign. The production answer: limit by both IP AND target username, simultaneously.

POST /login { "username": "sarah", "password": "..." }

// Two independent limiter checks — fail if either trips
checkRateLimit(key="login:ip:" + request.ip,         limit=20, window=60)
checkRateLimit(key="login:user:" + body.username,    limit=5,  window=60)

// IP cap = 20/min: stops one attacker from hammering many users
// User cap = 5/min: stops a botnet from each-trying-once at one target user

The IP limit catches credential-stuffing scripts. The username limit catches distributed brute-force where each bot tries Sarah's password once. Together they close both holes.

Step 12

Tiering & Configuration

Not all customers are equal, and not all endpoints are equal. The configuration service holds a 2D matrix: tier × endpoint → limit.

Tier	/search (cheap)	/api/orders (medium)	/generate-image (expensive)
Free	`100 / min`	`20 / min`	`5 / hour`
Pro	`10K / min`	`1K / min`	`100 / hour`
Enterprise	unmetered	`50K / min`	`5K / hour`

Per-endpoint cost multipliers

Some requests cost more downstream than others. A search query touches Elasticsearch for ~5ms; an image-generation request burns 2 seconds of GPU. Model the difference with token costs: /search consumes 1 token, /generate-image consumes 50. The user has a single per-minute budget of N tokens, deducted per request. This is exactly how OpenAI's API limits work.

Hot-reload config without restart

The middleware polls the Configuration Service every 10 seconds for updates (or subscribes to a pub/sub channel for instant invalidation). Admin changes a limit in the UI ⑨ → Configuration Service ④ writes new value → all middleware ③ instances pick it up within seconds. Critical: never restart gateway pods to push a limit change — that's an operational nightmare and the whole reason this service exists.

The "noisy neighbor enterprise" pattern: sometimes a single enterprise customer has a legitimate need for a bump (Black Friday, product launch). The UI makes this a one-click operation: admin types "bump customer X to 5× normal for 4 hours". The change propagates in 10 seconds and auto-reverts after the window closes.

Step 13

Failure Modes — What if Redis Dies?

The rate limiter sits on the critical path of every request. If Redis goes down, the middleware can't check limits. You have two choices, and they're both painful.

✅ Fail Open — let traffic through

If Redis is unreachable, the middleware logs a loud alarm and allows the request. The protected service handles the traffic without rate limiting for the duration of the outage.

Pros: users see no impact from a limiter outage. Your availability SLO is preserved. Your engineers can fix Redis without the entire site being down.

Cons: during the outage, you have no protection — an attacker who notices could pile on. A noisy customer could exhaust downstream capacity.

⚠️ Fail Closed — 429 everyone

If Redis is unreachable, the middleware denies all requests with 429 (or 503). The protected service stays safe but the user-visible service is effectively down.

Pros: backend is protected from any abuse during the outage. Strong consistency with the "limits are sacred" promise.

Cons: a Redis blip takes down your entire API. Your availability SLO craters. Customers churn.

The production answer: fail open + circuit breaker + L1 fallback

Almost every production system picks fail open. The reasoning: rate limiting is a defense, not a feature — losing it for 60 seconds is a manageable security risk; losing your entire API is a business disaster. Three additional safeguards make fail-open less scary:

🔌 Circuit breaker

Wrap Redis calls in a circuit breaker. After 5 consecutive failures, the breaker opens and the middleware stops calling Redis for 30 seconds — saving the latency of timeout-waiting on every request. Auto-recovery via probe requests.

💾 In-process L1 fallback

Each middleware also runs a tiny per-pod sliding-window counter as a fallback. When Redis is down, the middleware uses local counts only. The cluster-wide guarantee weakens to "each pod still enforces its share" — not perfect but much better than zero protection.

🚨 Loud alarming

Every fail-open event pages on-call. The expectation is: "we're vulnerable right now, fix Redis NOW". Combined with the L1 fallback, the window of total exposure is short.

The decision is industry-wide: Stripe, GitHub, Cloudflare, AWS — all fail open. The one exception is regulated workloads (banking, healthcare) where over-limit traffic could cause regulatory issues; those occasionally fail closed and accept the availability hit.

Step 14

Interview Q&A

Fixed window vs. sliding window — when do you pick each?

Fixed window only when the limit is loose and the boundary attack is acceptable (e.g., "max 10K requests/day for batch reporting"). Sliding window with counters for everything else — it's strictly better than fixed at a tiny memory cost. Sliding window log only when perfect accuracy is required and you have very few entities to track (e.g., a financial trading API with 100 institutional clients).

Why Redis over MySQL or DynamoDB for the counter store?

Three reasons. (1) Latency: Redis is in-memory with ~0.5ms p99 ops; MySQL on disk is 5-10ms minimum. We have a 5ms total budget. (2) Atomicity: Redis Lua scripts give us free serializable atomicity on a single key, no transaction overhead. (3) Throughput: a single Redis node does 100K+ ops/sec on commodity hardware; MySQL tops out 10× lower. DynamoDB is fast but adds ~5ms of network hop and you pay per request — at our QPS that bill becomes silly fast.

How do you handle Redis being completely down?

Fail open with safeguards (see §13). Circuit-break after 5 failed Redis calls so we stop wasting latency on timeouts. Each middleware pod runs a tiny per-pod fallback counter so we still have some protection during the outage. Pages fire immediately to on-call. The reasoning: a rate limiter outage shouldn't take down the service it's protecting — that turns a small security risk into a total availability disaster.

How do you do distributed rate limiting across regions?

Two patterns. Pattern A — per-region limits with N× cap: if global cap is 1000/min, give each of 4 regions 250/min — no cross-region coordination needed. Easy, but a customer can hit 250 in one region and 250 in another simultaneously. Acceptable for soft limits. Pattern B — central global counter with regional CRDT-style sync: regions periodically push deltas to a global Redis. Eventually consistent, brief over-cap possible. Use Pattern A unless the limit must be globally exact.

How do you avoid the thundering herd at the top of the hour?

Two things. (1) Counters expire naturally in sliding-window — old buckets just roll off, so there's no "reset everyone at once" event. This is a built-in advantage of sliding window over fixed window. (2) When clients do retry after a 429, our Retry-After header includes a random jitter (e.g., 17 ± 5 seconds) so even a million clients hitting 429 at the same second don't all retry at the same second. Standard backoff-with-jitter pattern.

How do you rate-limit the login endpoint when login itself needs protecting?

Hybrid limit by both IP and target username (§11). The IP limit (e.g., 20 attempts/min) catches credential-stuffing scripts hammering one IP. The username limit (e.g., 5 attempts/min on Sarah's email) catches distributed botnets where each bot tries Sarah's password once. Both checks run; either trip causes a 429. CAPTCHA after the first 429 is the production-grade addition — degrades the attack to one-attempt-per-CAPTCHA-solve.

How does the limiter add only 5ms when it's on every request?

Co-locate Redis with the gateway tier. Same VPC, same AZ if possible — RTT is sub-millisecond. Lua script for one round-trip instead of multiple. In-process cache for limit values so we don't hit the Configuration Service. Connection pooling and pipelining for Redis. Add it all up: typical p99 is 1-2ms, well under the 5ms budget.

A user complains they're being throttled but their dashboard shows they're under quota. What's likely wrong?

Three usual suspects. (1) Cluster-wide vs. per-region counts: their dashboard shows their per-region count, but the limit is global, and they're hitting a different region too. (2) Token cost not visible: they think they're at 50 requests/100 limit, but each request to /heavy-endpoint costs 5 tokens — they're actually at 250/100. (3) Time skew: their dashboard polls every 60s but the sliding window updates every second; the dashboard lags. Fix: surface "tokens remaining" not "requests remaining" in the dashboard, with a real-time counter from the same Redis the limiter reads.

The one-line summary the interviewer remembers: "It's a sliding-window-with-counters algorithm running as in-process middleware in the API gateway, with state centralized in a sharded Redis cluster, made atomic via Lua scripts, configured by a hot-reloadable config service, and designed to fail open behind a circuit breaker so a Redis outage never takes the API down."