← Back to Design & Development
High-Level Design

OTP Validation System

Production-ready OTP architecture — hashing, 5-layer rate limiting, Redis + Postgres split, replay protection, and every use case from login to payment.

AI layer view — showing only the AI integration points (risk scoring + adaptive rate limiting). The base OTP design is hidden — toggle back to Basic Infra to read it.
Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →
Step 1

Clarify Requirements

Always open an HLD by aligning on scope. OTP is a single building block — but different products treat it very differently (2FA vs payments vs password reset). Pin these down first.

✅ Functional Requirements

  • Generate a random 4–6 digit numeric code
  • Deliver via SMS, email, or authenticator app
  • Store with a short TTL (2–10 min by use case)
  • Verify submitted OTP against stored value
  • Enforce single-use — delete after success
  • Invalidate after N failed attempts
  • Rate-limit generation & verification independently
  • Audit every OTP event (generation, attempt, success, failure)

⚡ Non-Functional Requirements

  • Latency: verify p99 < 50 ms (Redis in-memory)
  • Availability: 99.95% — auth blocks logins
  • Durability: audit log must survive Redis loss
  • Security: plaintext OTP never persisted, never logged
  • Compliance: PCI-DSS, GDPR, HIPAA, RBI-ready
  • Scale: handle SMS bombing & brute-force under load
Out of scope (confirm with interviewer): SMS provider failover strategy, push-notification OTP, hardware tokens (YubiKey/TOTP apps are adjacent), OAuth/SSO, and session management after OTP success.
Step 2

Capacity & Scale Estimates

Back-of-envelope first — it drives storage, rate-limit tiers, and the Redis/Postgres split.

MetricAssumptionResult
MAU50 M
OTPs / active user / day~3 (login + sensitive actions)~150 M/day
Peak QPS (3× avg)150 M ÷ 86 400 × 3~5 200 QPS
OTP payload (hash + metadata)~200 B per record
Peak active OTPs in Redis (5-min TTL)5 200 × 300~1.5 M keys ≈ 300 MB
Audit rows / year (Postgres)150 M × 365~55 B rows/yr → partition monthly
SMS cost (even at ₹0.10/SMS)150 M/day₹15 M/day → rate limits matter economically, not just for security
Takeaways: Redis easily fits active OTPs in RAM. Postgres audit needs partitioning (or TimescaleDB / Cassandra for very high scale). Rate limiting protects SMS spend as much as it protects users.
Step 3

Actors & Use Cases

Who triggers OTP flows and what they're trying to accomplish. Color-coded by actor.

flowchart LR U([User]) PS([Payment Service]) AS([Auth Service]) SEC([Security Team]) U --> L[Login / 2FA] U --> PR[Password Reset] U --> AC[Account Change] PS --> PAY[Payment Confirm] AS --> REG[Registration Verify] SEC --> AUD[Audit & Investigate] style U fill:#e8743b,stroke:#e8743b,color:#fff style PS fill:#4a90d9,stroke:#4a90d9,color:#fff style AS fill:#38b265,stroke:#38b265,color:#fff style SEC fill:#9b72cf,stroke:#9b72cf,color:#fff

🟢 Login / 2FA

Phone number → OTP → JWT. 5-min TTL, 3 regenerations per 15 min.

🔐 Password Reset

Email + reset token + OTP. 10-min TTL. Invalidate other sessions on success.

💳 Payment Confirm

Tied to transaction_id. 2-min TTL (shorter MITM window).

✉️ Account Change

Sent to the OLD contact — defends a compromised account.

📝 Registration

Verify email/phone ownership before provisioning the account.

🛡️ Audit / Investigate

Security team queries the audit log by user_id / ip — never sees OTP values.

Step 4

High-Level Architecture

Six moving parts. Redis is the hot store, Postgres the cold/audit store, the OTP service owns all hashing and verification.

flowchart LR CL([Client App]) GW[API Gateway
per-IP rate limit
CAPTCHA] OTP[OTP Service
generate / hash / verify] RL[Rate Limiter
Redis counters] R[(Redis
hashed OTP + TTL)] PG[(PostgreSQL
audit log)] SMS[SMS / Email
Provider] MON[Monitoring
anomaly alerts] CL -- HTTPS --> GW GW --> OTP OTP --> RL OTP --> R OTP --> SMS OTP -. async write .-> PG OTP -. metrics .-> MON RL --> R style CL fill:#e8743b,stroke:#e8743b,color:#fff style GW fill:#4a90d9,stroke:#4a90d9,color:#fff style OTP fill:#38b265,stroke:#38b265,color:#fff style R fill:#9b72cf,stroke:#9b72cf,color:#fff style PG fill:#9b72cf,stroke:#9b72cf,color:#fff style SMS fill:#d4a838,stroke:#d4a838,color:#000 style MON fill:#3cbfbf,stroke:#3cbfbf,color:#000

🔷 Why Redis as primary

  • Native TTL — no cleanup cron needed
  • Sub-ms verify latency
  • Atomic DECR for attempts-remaining
  • Cluster-shardable by user_id

🗄️ Why Postgres for audit

  • Durability + compliance retention
  • Rich queries for fraud investigation
  • Survives Redis cluster loss
  • Partition by month for B-row scale
Key design call: the write to Postgres is async (Kafka or fire-and-forget). Auth latency cannot wait on an RDBMS round-trip.
Step 5

API Design

Two endpoints carry 99% of traffic. Keep the surface area small — fewer endpoints = fewer attack vectors.

POST /v1/otp/generate
// Request
{
  "identifier": "+919876543210",   // phone OR email
  "purpose":   "LOGIN",            // LOGIN | RESET | PAYMENT | UPDATE
  "context":   { "transaction_id": "txn_abc" }  // optional, binds OTP to a resource
}

// 200 — OTP queued for delivery
{
  "otp_id":       "uuid-v4",
  "expires_at":    1735128300,
  "attempts_left": 5,
  "cooldown_sec":  30    // client must wait before "resend"
}

// 429 — rate limited (generic, no user-enumeration)
{ "error": "TOO_MANY_REQUESTS", "retry_after": 120 }
POST /v1/otp/verify
// Request
{
  "otp_id": "uuid-v4",
  "code":   "847231",
  "context": { "transaction_id": "txn_abc" }
}

// 200 — success (one-shot; server deletes OTP immediately)
{ "verified": true, "session_token": "jwt..." }

// 401 — mismatch (generic; no "wrong code" vs "expired" leak)
{ "verified": false, "attempts_left": 3 }

// 410 — OTP gone (expired, used, or blown through attempt limit)
{ "error": "OTP_NOT_ACTIVE" }
API hardening notes: responses are deliberately uniform — the same error shape for "wrong code" and "expired code" prevents probing. otp_id is opaque (UUID, not the phone number) so the client request can be logged without leaking the identifier.
Step 6

Data Model — Redis + Postgres

Two stores, two very different jobs.

Redis (hot path)

Key / Value layout
# Active OTP record
KEY   otp:{otp_id}
VALUE {
  "user_id":       "u_1234",
  "identifier_hash": "sha256(+919876543210)",
  "purpose":       "LOGIN",
  "hash":          "$2b$12$...",   # bcrypt hash of the OTP
  "attempts_left": 5,
  "context":       { "transaction_id": "txn_abc" },
  "created_at":    1735128000
}
TTL   300  # seconds — auto-deletes, no cron needed

# Rate-limit counters (sliding-window via INCR + EXPIRE)
KEY   rl:gen:user:{user_id}    TTL 900   # 3 per 15 min
KEY   rl:gen:ip:{ip}           TTL 3600  # 10 per hour
KEY   rl:ver:user:{user_id}    TTL 3600  # 10 per hour
KEY   rl:ver:ip:{ip}           TTL 3600  # 20 per hour

Postgres (audit, durable)

erDiagram USER ||--o{ OTP_AUDIT : generates OTP_AUDIT ||--o{ OTP_ATTEMPT : "has many" USER { uuid user_id PK string email string phone_hash timestamp created_at } OTP_AUDIT { uuid otp_id PK uuid user_id FK string purpose string identifier_hash timestamp created_at timestamp expires_at string outcome "GENERATED|VERIFIED|EXPIRED|EXHAUSTED" string ip_address string user_agent } OTP_ATTEMPT { bigserial attempt_id PK uuid otp_id FK timestamp attempted_at boolean success string ip_address string failure_reason }
⚠️ Critical: neither the OTP code nor its hash is ever written to Postgres. The audit log tells you who tried to verify what and when — not what the secret was. This matters for GDPR/PCI auditors and for insider-threat containment.
Step 7

Generate OTP — Sequence Flow

Happy path + the three places it can fail loudly.

sequenceDiagram actor U as User participant GW as API Gateway participant S as OTP Service participant RL as Rate Limiter participant R as Redis participant SMS as SMS Provider participant Q as Audit Queue U->>GW: POST /otp/generate {phone, LOGIN} GW->>GW: per-IP rate limit GW->>S: forward S->>RL: INCR rl:gen:user + rl:gen:ip alt limit exceeded RL-->>S: over limit S-->>U: 429 TOO_MANY_REQUESTS else within limit S->>S: generate 6-digit OTP (SecureRandom) S->>S: bcrypt hash(otp, cost=12) S->>R: SET otp:{uuid} {hash, ...} EX 300 S->>SMS: send plain OTP S-)Q: enqueue audit event S-->>U: 200 {otp_id, expires_at} end

🎲 Randomness

SecureRandom / crypto.randomInt / secrets.randbelow. Never Math.random().

🔒 Hash cost

bcrypt work factor 12 (~250 ms). Slow on purpose — defeats offline GPU cracking.

📮 Delivery

SMS send is synchronous for UX feedback, but provider timeouts fail fast (2 s budget).

Step 8

Verify OTP — Sequence Flow

Constant-time comparison + atomic attempt decrement are the two things people get wrong.

sequenceDiagram actor U as User participant GW as API Gateway participant S as OTP Service participant R as Redis participant Q as Audit Queue U->>GW: POST /otp/verify {otp_id, code} GW->>S: forward (rate limited) S->>R: GET otp:{otp_id} alt not found R-->>S: nil S-->>U: 410 OTP_NOT_ACTIVE else found R-->>S: {hash, attempts_left} S->>S: bcrypt.verify(submitted, hash) %% constant-time alt match S->>R: DEL otp:{otp_id} %% single-use S-)Q: audit VERIFIED S-->>U: 200 {verified:true, jwt} else mismatch S->>R: HINCRBY attempts_left -1 alt attempts_left == 0 S->>R: DEL otp:{otp_id} S-)Q: audit EXHAUSTED S-->>U: 410 OTP_NOT_ACTIVE else still alive S-)Q: audit FAILED_ATTEMPT S-->>U: 401 {attempts_left} end end end
⭐ The DEL on success is non-negotiable. If you only mark is_used=true and forget to delete, a race between two concurrent verifies can double-authenticate. Deleting is atomic; flags are not.
Step 9

Why Hash OTPs (and Why Plaintext Is a Career-Ender)

This is the single most common OTP mistake in production systems. Here's every way plaintext storage betrays you.

💣 Database breach

Redis exposed? Every live OTP is instantly usable. No cracking — just read and replay. Account-takeover at wire speed.

👤 Insider access

Any engineer with read access can impersonate any user, and the audit trail looks identical to a legitimate login.

📜 Log leaks

Plaintext OTPs show up in Sentry, Datadog, slow-query logs, CloudWatch, backup dumps. One log.debug(req.body) is all it takes.

📋 Compliance

PCI-DSS, GDPR, HIPAA, RBI all prohibit plaintext authentication data. Audit failure → fines + forced remediation.

🧠 Memory dumps

A core dump of the app server exposes any OTP sitting in RAM.

🌐 Side channels

Metrics and APM traces often capture request payloads. Hashing at the edge closes that leak surface.

The rule: the plaintext OTP exists in exactly two places — the SMS message and the user's head. The server persists only the bcrypt/Argon2 hash. Between generation and delivery it sits in memory for milliseconds; after that, only the hash remains.

Hash algorithm choice

AlgorithmVerdictWhy
MD5 / SHA-1❌ NeverBroken; GPU can brute-force 6-digit space in seconds
SHA-256 (plain)❌ NeverToo fast — 1M combinations cracked in <1 sec
bcrypt✅ FineWork factor 10–12 slows attackers; battle-tested
scrypt✅ GoodMemory-hard — resists custom ASIC attacks
Argon2id✅ BestModern, tunable memory + time + parallelism
Step 10

5-Layer Brute Force Defense

A 6-digit OTP has only 1 000 000 combinations. Without layered defense, an attacker scripts 10 req/sec and is in within ~28 hours. Here's the layered wall.

LayerScopeLimitAction on breach
1 — Per-OTP attemptsSingle otp_id5 wrong triesInvalidate OTP
2 — Per-user rateuser_id10 verifies / hour30-min account lock
3 — Per-IP rateSource IP20 verifies / hourTemporary IP block
4 — Exponential backoffPer-user per-OTP0s → 5s → 30s → 2mKills scripting speed
5 — CAPTCHAGatewayAfter 3 failed attemptsForces human proof
Defense in depth: one layer can be bypassed (e.g., rotating IPs defeats Layer 3). All five together make brute force economically infeasible — attackers move on to softer targets.

SMS bombing defense (generation side)

🙋 Per user

Max 3 OTP generations per 15 minutes.

🌐 Per IP

Max 10 generations per hour.

⏳ Resend cooldown

Reject "resend" < 30 s after previous send.

Step 11

OTP Lifecycle — State Machine

Every OTP follows exactly one path from birth to death. Four terminal states.

stateDiagram-v2 [*] --> GENERATED : POST /generate GENERATED --> VERIFIED : code matches GENERATED --> FAILED_ATTEMPT : code mismatches (>0 attempts left) FAILED_ATTEMPT --> GENERATED : still alive FAILED_ATTEMPT --> EXHAUSTED : attempts_left == 0 GENERATED --> EXPIRED : TTL elapses VERIFIED --> [*] EXHAUSTED --> [*] EXPIRED --> [*]
Why model it explicitly? Every outcome must map to exactly one audit row — no "in between" states. This is what makes fraud investigation tractable: every OTP that ever existed has a terminal state on record.
Step 12

Use Case Variants

Same machinery, four different policies. TTL and delivery channel are the main knobs.

Use caseTTLChannelBindingSpecial
Login / 2FA5 minSMSuser_idIssue JWT on success
Password reset10 minEmailOTP + reset-tokenKill all other sessions
Payment confirm2 minSMStransaction_idLock against replay
Account change3 minOLD email/phoneuser_idNotify old contact after success

💳 Why Payment TTL is shortest

Short window = smaller MITM attack surface. Payment OTPs must also bind to transaction_id — an OTP generated for a ₹500 transfer cannot be replayed on a ₹50 000 one.

✉️ Why Account-Change OTP goes to OLD contact

If the attacker has already taken over the account, they control the current contact. Sending OTP to the previous email/phone forces them to also own that — raising the attack bar significantly.

Step 13

Common Mistakes & How to Dodge Them

Every single one of these has been shipped to production by a real team at some point.

❌ Predictable randomness

Math.random(), timestamp-seeded PRNGs, or incrementing counters. Use a CSPRNG.

❌ Long TTLs

Anything > 10 min is an attack window. Match TTL to use case.

❌ Same code on resend

Regenerate on every resend. Reusing the same OTP halves the rotation benefit.

❌ No attempt cap

OTP must die after N failures — the attack must not be able to retry forever within the TTL.

❌ Logging OTP values

Scrub request-body logging for any field named code, otp, pin. Add allow-list logging by default.

❌ Fast hash (SHA-256 raw)

Use bcrypt/scrypt/Argon2. A 6-digit space falls to fast hashes in milliseconds.

❌ Non-constant-time compare

== on hashes leaks timing. Use crypto.timingSafeEqual or the library's built-in verify.

❌ Forgetting single-use

DEL on success. Otherwise the attacker sniffs the OTP and replays it after the user consumes it.

Step 14

Monitoring & Alerts

OTP systems are where anomalies appear first — monitoring them buys early warning for larger attacks.

📈 Signals worth paging on

  • OTP generation rate > 3× baseline → SMS bombing
  • Verify-failure rate > 10% per minute → brute force
  • > 50 OTPs / hour from one IP → bot farm
  • Verification from geography user never logs in from
  • SMS provider 5xx > 2% → rotate to secondary

📊 Signals worth dashboarding

  • Verify p50 / p99 latency
  • Generate → verify conversion funnel
  • Median time-to-verify (UX proxy)
  • Attempts-left histogram (brute-force smell test)
  • SMS cost per DAU
AI Layer ①

Risk-Based Authentication — Skip-OTP Intelligence

Before sending an OTP, first ask: is this login trustworthy? Trusted users skip the OTP entirely. Normal users get the standard OTP. Suspicious ones get extra checks (biometric + OTP). The OTP system itself stays exactly the same — we just put a smart filter in front of it.

💡 Think of it like a smart bouncer at a club. Today, everyone walking in has to show ID — even regulars (= an OTP for every login). A smart bouncer would: wave through known regulars (skip OTP), ask normal customers for ID (standard OTP), and demand extra proof from anyone who looks off (force step-up). The "AI" here is just a pattern-matcher that learned from past logins to score how risky each new one looks (0 = trusted, 100 = very suspicious).
🎓 New to AI? Plain-English glossary of terms used across all three AI sections:
  • AI / ML model — a really good pattern-matcher trained on past data. Show it thousands of real vs. fraudulent logins; it learns to tell them apart.
  • XGBoost / LightGBM — popular AI techniques that work like a giant decision flowchart with hundreds of branches. Fast and easy to debug — that's why we use them here.
  • Feature store (Feast / Tecton) — a database holding all the signals we collect about users: device, location, login history, typing patterns. The AI reads from here to make a decision.
  • Inference — fancy word for "running the AI to get a prediction." Each call takes ~20–30 ms.
  • SHAP values — the AI's reasoning for a decision. Like: "flagged because new country (40%) + new device (35%) + unusual time (25%)." Helps debug AND satisfies regulators who require explainability.
  • Isolation Forest — an AI technique great at finding "odd ones out" — unusual data points hidden among normal traffic.
  • Prophet — AI that predicts what normal traffic should look like (think weather forecasting). Alerts when reality deviates from the forecast.
  • Drift — when user behavior changes over time and the AI's old understanding becomes outdated. We monitor for it and retrain.
  • Shadow mode — running the AI silently in the background for 2–4 weeks before turning it on for real, to confirm its decisions would have been good.
  • p99 < 30 ms — 99% of AI decisions complete in under 30 milliseconds. Fast enough that users won't notice the AI is even there.
  • Behavioral biometrics — how a person types, swipes, holds their phone. Every person has a unique pattern, hard to fake.
  • Device fingerprint — a "digital signature" of a device made from browser, OS, screen size, timezone, fonts, etc. Lets us recognize the same device across sessions without a cookie.
🪡 Where AI plugs in — at the front door (API gateway). Every login request passes through the gateway first. We add one new step there: ask the AI "how risky is this login?" Based on the answer (a number from 0 to 100), the gateway routes the request:
  • score < 30 (looks safe) → log the user in directly, no OTP needed
  • 30 – 70 (medium) → send the usual OTP, business as usual
  • score ≥ 70 (looks suspicious) → demand more proof (OTP plus fingerprint or face)

Big picture — the AI sits between the user and the OTP service

flowchart LR CL([Client]) GW[API Gateway] RS[Risk Scoring
XGBoost · p99 < 30ms] FS[(Feature Store
Feast / Tecton)] OTP[OTP Service] SESS[Issue Session
skip OTP] STEP[Step-up
Biometric + OTP] PG[(Postgres
+ training labels)] CL --> GW GW --> RS RS --> FS RS -- score < 30 --> SESS RS -- 30 - 70 --> OTP RS -- > 70 --> STEP OTP -. async .-> PG PG -. daily retrain .-> FS style RS fill:#9b72cf,stroke:#9b72cf,color:#fff style FS fill:#3cbfbf,stroke:#3cbfbf,color:#000 style OTP fill:#38b265,stroke:#38b265,color:#fff

Step-by-step — what happens during a login (with AI added)

sequenceDiagram actor U as User participant GW as API Gateway participant RS as Risk Service participant FS as Feature Store participant OTP as OTP Service participant SESS as Session Svc participant STEP as Step-up Svc U->>GW: POST /auth/start {phone, device, ctx} GW->>RS: score(user, device, ip, ua) RS->>FS: fetch features FS-->>RS: ~80 features (≤10ms) RS->>RS: XGBoost inference (≤20ms) RS-->>GW: {risk_score, shap} alt risk_score < 30 (trusted) GW->>SESS: issueSession(user_id) SESS-->>U: 200 {jwt} — OTP skipped else 30 ≤ score < 70 (medium) GW->>OTP: forward to /otp/generate Note over OTP: standard generate · hash · store OTP-->>U: 200 {otp_id, expires_at} else score ≥ 70 (high) GW->>STEP: requireStepUp([OTP, BIOMETRIC]) STEP-->>U: 401 {challenge_id, factors} end

📥 What signals the AI looks at

  • Device fingerprint — a digital signature of the user's phone or laptop
  • Where they're coming from — IP address, network, distance from where they usually log in
  • How fast they're trying — number of logins in the last hour from this user / device / IP
  • Is the time normal? — e.g., 3 am login when they usually log in at 9 am is suspicious
  • How they type and tap — every person has a unique rhythm; very hard to fake
  • Where they're logging in from — web, mobile app, or external partner

🧠 What kind of AI we use

  • XGBoost — a popular, fast, accurate AI model that can explain its own decisions
  • Trained on 3 months of past login data — labelled as real or fraudulent
  • Output: a risk score from 0 to 100 (higher = more suspicious)
  • Hosted on AI-serving infrastructure — 99% of decisions in under 30 ms
  • Reads user signals from the feature store in under 10 ms
Decision policy
// Risk-tiered auth routing
async function startAuth(req) {
  const features = await featureStore.fetch(req.user_id, req.device_id);
  const { risk_score, shap } = await riskService.score(features, { timeout: 30 });

  // audit the decision (without leaking raw features)
  audit.emit({ user_id: req.user_id, risk_score, shap_top3: shap.slice(0, 3) });

  if (risk_score < 30) return issueSession(req.user_id);            // trusted → skip OTP
  if (risk_score < 70) return generateOtp(req.user_id, "LOGIN");     // medium → standard OTP
  return requireStepUp({ factors: ["OTP", "BIOMETRIC"] });            // high → step-up
}
🔁 The AI keeps learning over time. Every login outcome (succeeded, failed, user gave up, later reported as fraud) is fed back to the AI as a labelled example. Once a day, the model retrains so it stays sharp as attackers change their tactics.
🛟 What if the AI breaks? If the AI service is slow or down, we don't lock users out — we just fall back to send OTP to everyone (the basic flow). Rule: never let a broken AI block legitimate users, but also never let it pretend everything's safe when it isn't. Hard timeout — if the AI doesn't answer in 30 ms, we skip it for that login.
AI Layer ②

Adaptive Rate Limiting — Beyond Static Thresholds

Smart rate limiting that learns what normal traffic looks like at every hour of every day. Catches sneaky attacks (1000 different IPs each making 1 attempt — invisible to per-IP limits) and stays quiet during real traffic spikes like Black Friday or a viral app launch.

💡 Think of it like a smart spam filter. Old email rules say "block any sender who emails you 100 times." Spammers got smart — now they send 1 email each from 1000 different addresses, slipping under the limit. A modern spam filter spots the overall pattern (writing style, suspicious links, sending behavior) even when no single sender breaks the rule. Same idea here: the AI watches the shape of incoming OTP traffic and catches attacks even when no individual user crosses a static threshold.
🪡 Where AI plugs in — at two places.
  • Place 1 — at the front door (API gateway): right next to the existing rule that limits how many requests come from one IP. The AI looks at the bigger picture and says "allow / show captcha / slow down / block." The gateway uses both the old rule AND the AI's view.
  • Place 2 — inside the OTP service's rate limiter: right next to the per-user counters that say "this user has tried 3 times in 15 minutes." The AI's score is added as a second opinion.
Both places are additive — the old simple rules stay as a safety floor (always work, easy to debug). The AI catches the sneaky stuff the simple rules miss.

Step-by-step — what happens during an OTP request (with AI added)

sequenceDiagram actor U as User participant GW as API Gateway participant AE as Anomaly Engine participant OTP as OTP Service participant RL as Rate Limiter participant R as Redis U->>GW: POST /otp/generate Note over GW: Hook A GW->>GW: static per-IP check (5-layer floor) GW->>AE: score(req/min, IP entropy, UA diversity, fail ratio) AE-->>GW: {anomaly_score, action} GW->>GW: combine(static, anomaly) → verdict alt verdict = allow GW->>OTP: forward Note over OTP,RL: Hook B OTP->>RL: INCR rl:gen:user + rl:gen:ip OTP->>AE: score(per-user features) AE-->>OTP: {user_anomaly} OTP->>OTP: combine → admit OTP->>R: SET otp:{id} {hash} EX 300 OTP-->>U: 200 {otp_id, expires_at} else verdict = captcha GW-->>U: 401 + captcha challenge else verdict = throttle / block GW-->>U: 429 / 403 + SOC alert end

🔍 Real-time "odd one out" detection

  • Uses Isolation Forest — an AI good at spotting unusual patterns
  • Watches signals like: requests per minute, how many different IPs / browsers, success vs failure ratio
  • Catches sneaky attacks where 1000 different IPs each try once — invisible to per-IP rules
  • Looks at the last 5 minutes of traffic, refreshing every 30 seconds

📈 "What's normal traffic right now?" forecasting

  • Uses Prophet — an AI that predicts traffic patterns (like weather forecasting)
  • Knows a Black Friday 8 pm spike is normal — won't trigger an alarm
  • Knows a Tuesday 3 am spike is not normal — flags it
  • Stops the chronic "3 am false alarm that wakes the on-call engineer"

Decision pipeline

flowchart LR REQ([Request]) --> SR[Static Rules
deterministic floor] REQ --> AE[Anomaly Engine
Isolation Forest
+ Prophet baseline] SR --> JOIN{Combine} AE --> JOIN JOIN -- score < 0.3 --> ALLOW[Allow] JOIN -- 0.3 - 0.7 --> CAP[Invisible CAPTCHA] JOIN -- 0.7 - 0.9 --> THR[Throttle
+ MFA prompt] JOIN -- > 0.9 --> BLK[Block + SOC alert] style AE fill:#9b72cf,stroke:#9b72cf,color:#fff style SR fill:#38b265,stroke:#38b265,color:#fff
📌 Important — the AI adds to the old rules, it doesn't replace them. The simple static rules stay as a safety floor: predictable, easy to debug, easy to explain to auditors. The AI sits on top and catches the things simple rules miss. If the AI breaks, the floor still works — nothing falls through.

Things to watch out for in production

📊 Watch for "drift"

User behavior changes over time, so the AI's old understanding can get stale. Alert if signals drift too far from what the AI learned. Run A/B tests every week comparing the live model to a newer one.

🧪 Run silently first ("shadow mode")

Before letting the AI block real users, run it for 2 weeks just watching — logging what it would have done. Then compare with what actually happened.

🔐 Protect user privacy

Never store sensitive info in plain text. Hash IPs and personal data before saving. Keep recent data 30 days, archived data 6 months — GDPR rules.

⚖️ Be able to explain decisions

For every high-risk block, log why the AI made that call (e.g., "new country + new device + odd time"). Required by GDPR if a user asks "why was I blocked?"

🧯 Have an off switch

One feature flag must instantly turn the AI off and fall back to the simple rules. Test this disaster scenario regularly so you know it works.

💰 Cost vs. benefit

Each AI decision costs ~₹0.05. At 5,200 requests per second, that's ~₹22 K/day. But cutting SMS volume by 30–50% saves millions per day. Pays for itself many times over.

📝 The whole AI layer in one line: one AI decides "should we even send an OTP?", another AI decides "how aggressively should we block this traffic?" Both sit on top of the existing OTP system — nothing about the OTP itself (how it's generated, hashed, stored, used once) changes. Needs about 3 months of past login data to train initially.
AI Layer ③

Why Add It — Use Cases & Security Wins

A 6-digit OTP is a thin line of defense. Once a user is tricked into reading their OTP to an attacker, it's already over. Static rate limits are easy to map and bypass. The AI layer turns OTP from "challenge everyone the same way" into "challenge intelligently."

💡 In plain words: the basic OTP system is like a single locked front door — fine against casual intruders. The AI layer adds a security camera, a smart doorbell that recognizes regulars, and a doorman who notices when someone's casing the place. Each one alone helps; together they stop attackers your front door can't.

🛡️ Attacks defeated that base OTP cannot stop

🎣 OTP Phishing / Vishing

User is socially engineered into reading the OTP to an attacker on a fake support call. Plain OTP only checks "code matches?" and lets it through. Risk scoring catches the surrounding signal — new device, unusual geo, screen-share active, abnormal session timing — and demands step-up.

📱 SIM Swap

Attacker convinces the carrier to port the user's number. OTP delivers right into their hand. The AI layer notices: device fingerprint changed, IP from a new ASN, behavioral biometrics don't match → blocks the login and alerts the user via the old contact.

🌍 Credential Stuffing

Attacker has 1000 leaked passwords from another site and tries them all on your login from 500 rotating IPs. No single IP makes enough attempts to trigger the static limit. But the AI notices the bigger picture: way too many different IPs / devices, low success rate → blocks the whole pattern.

💣 Distributed SMS Bombing

Attacker hammers the "resend OTP" button from many rotating IPs — to drain your SMS budget or harass a specific user. Per-IP rules don't catch it. The AI's traffic forecast spots the unusual spike compared to normal hours → blocks and alerts the security team.

🐢 Slow-and-Low Brute Force

A patient attacker spreads guesses over many days to fly under any static "X attempts per hour" rule. The AI's forecast model knows that "20 attempts per hour at 3 am from this country" isn't normal — even when no single time window technically breaks any rule.

🤖 Bot Signup Fraud

Bots register thousands of fake accounts to claim signup bonuses or referral payouts. Behavioral biometrics catch them — the typing rhythm of a bot looks robotic and identical across accounts. Device fingerprints reveal the same cluster of fake "phones" being reused.

💡 Where the AI layer pays off most

🏦 Banking / Fintech

High-value transfers force step-up; small recurring ones skip OTP. Cuts SMS spend ~40% while reducing fraud losses. Aligns with the RBI risk-based AFA mandate.

🛒 E-commerce

Trusted devices skip OTP at checkout → fewer cart abandonments. Suspicious order velocity (gift cards, multiple shipping addresses) auto-escalates.

🏥 Healthcare

PHI access from unfamiliar devices/IPs gets stronger verification automatically — meeting the HIPAA "context-aware access" expectation.

💼 B2B SaaS

Admin actions and bulk-export operations get stepped up. Service-account login from a new IP triggers alert, not just OTP.

₿ Crypto / Exchanges

Withdrawals to a new address, above-threshold amounts, or first-time external wallets demand step-up — regardless of OTP success.

📞 Telco / Number Portability

SIM-swap defense for the carrier flow itself. Catches fraudsters attempting to take over the carrier account that delivers OTPs.

📊 Measured impact at scale

MetricBase OTP only+ AI layerDelta
SMS volume / day150 M~75–105 M−30 to −50%
SMS spend (@ ₹0.10/msg)₹15 M/day₹7.5–10.5 M/day−₹4.5–7.5 M/day
Account takeover incidentsbaseline0.2–0.4× baseline−60 to −80%
OTP screens per active session1.00.5–0.7~40% less friction
False-positive rate-limit pagesbaseline0.1–0.3× baseline−70 to −90%
ML inference cost~₹22 K/day @ 5 200 QPScapex
ROI in one line: ~₹22 K/day inference cost vs ~₹4.5–7.5 M/day SMS savings → the AI layer pays for itself 200× on SMS alone, before counting reduced fraud losses, lower support volume, and higher conversion from less friction.

📋 Regulatory tailwinds — risk-based auth is becoming required

Laws & standards already requiring this

  • PCI-DSS 4.0 — payment card security standard. Now requires risk-based auth for online card payments.
  • PSD2 SCA — Europe's payment law. Allows skipping a step only if the bank can show it's a low-risk transaction.
  • RBI — India's central bank. Mandates risk-based auth for digital payments.
  • NIST 800-63B — US government's authentication guidelines. Recommends "context-aware" (= risk-based) authentication.

Built-in support for audits & user rights

  • For every high-risk block, log why the AI made that call (using SHAP — the AI's reasoning)
  • Track when the AI's accuracy starts drifting and retrain regularly — keeps auditors happy
  • If a user asks "why was I blocked?" (a right under GDPR), we can answer in plain language
The bottom line: the base OTP system protects against opportunistic attackers using common tools. The AI layer protects against motivated attackers who've studied your system — phishers, SIM-swappers, credential-stuffing operators, SMS-bombing fraudsters, and patient slow-burn brute-forcers. If your product handles money, identity, or PHI, this layer is no longer optional.
Step 15

Trade-offs & Interview Talking Points

The bits an interviewer wants to hear you reason about out loud.

DecisionAlternativeWhy this choice
Redis as primary storePostgres onlySub-ms latency + native TTL; audit still goes to Postgres async
Hash (bcrypt)Plaintext / fast hashLimits blast radius of any leak; slow hash beats GPU brute force
6-digit numericAlphanumericUX on phone keypads; brute-force mitigated by rate limits, not entropy
UUID otp_idUse phone as keyOpaque ID — safer to log, no user enumeration
DEL on successis_used = true flagAtomic; avoids double-verify race
Async audit writeSync writeKeeps verify off the RDBMS critical path
Short TTL (2–10 min)24-hour validityDramatically smaller attack window; fine for UX
Generic error responsesDistinguish wrong/expiredPrevents user & state enumeration attacks
Step 16

Interview Q&A

The follow-ups that actually come up on system-design loops.

Why can't we just use HMAC(user_id, timestamp) as the OTP?
TOTP-style works when the client has a shared secret (Google Authenticator). For SMS delivery the user has no secret, so any formula visible on the server is also guessable from observed OTPs. Random + hash + TTL is the right primitive for SMS/email.
How do you handle Redis going down?
Run Redis in cluster mode with replicas; use WAIT to require replica ack before replying. On total cluster loss: fail-closed (reject new verifies) and rotate existing users through password re-verification — never fall back to Postgres as primary, because the audit store doesn't have hashes by design.
An attacker keeps generating OTPs for a victim's phone. What do you do?
Layered: (a) Layer-1 per-user 3-per-15-min cap, (b) client-side CAPTCHA after the 2nd request, (c) progressive cooldowns (30s → 2m → 10m) between resends, (d) trust-score on the IP — block IPs with high OTP-request/verify ratios.
Why bcrypt and not just PBKDF2-SHA256 with 100k iterations?
Functionally close, but bcrypt's algorithm is memory-bound (4KB working set) which raises the cost of custom ASICs. Argon2id is even better — memory-hard by design. PBKDF2 is fine if it's what your stack already uses; don't rewrite.
Single Redis or multi-region?
OTPs are short-lived and tied to a user's active session — region-local is fine. Don't pay cross-region replication cost. If a user roams mid-flow (rare), force regeneration in the new region.
What prevents an insider from extracting bcrypt hashes and brute-forcing offline?
bcrypt cost-12 ≈ 250 ms/attempt → 1M guesses ≈ 3 days on a single GPU, if you got a dump within the 5-min TTL. The TTL is the dominant defense here, not the hash. Combine with: access audits on Redis, at-rest encryption, and alerting on KEYS otp:* scans.
How do you support "voice call" OTP delivery as an option?
Same generate/verify flow — only the delivery adapter changes. Add a channel field (SMS | EMAIL | VOICE | WHATSAPP) and a Strategy interface per channel. Rate limits may need per-channel tuning (voice is expensive).
What about TOTP (Google Authenticator)?
Different trust model — shared secret installed at enrollment, no server-side per-request state. Typically offered alongside SMS OTP as a stronger option. Our system handles both by routing verification to different modules based on the factor the user has configured.
Step 17

Production Checklist

Before shipping any OTP system — tick every box.

  • OTPs hashed with bcrypt (cost 12) or Argon2id — never plaintext
  • TTL tuned per use case: login 5m, reset 10m, payment 2m, update 3m
  • 5-layer rate limiting: per-OTP, per-user, per-IP, backoff, CAPTCHA
  • Constant-time hash comparison (crypto.timingSafeEqual or library verify)
  • Audit log written to Postgres (without OTP values) via async queue
  • Single-use enforcement via DEL immediately on success
  • Attempt counter drains via atomic HINCRBY; OTP dies at 0
  • Exponential backoff between failed attempts
  • CAPTCHA triggered after 3 failures
  • Redis with TLS in transit + AUTH enabled + at-rest encryption
  • Notifications to user's old contact on sensitive changes
  • Zero OTP values in app logs, APM, error trackers, slow-query logs
  • Monitoring on generation spikes, failure rates, geography drift
  • SMS provider health checks + automatic failover to secondary
  • Generic error responses — no user/state enumeration leaks
  • CSPRNG for OTP generation — never Math.random()
The principle: OTP security isn't about cleverness — it's about discipline. Hash everything. Rate-limit everywhere. Keep TTLs short. Delete on success. The gap between a secure OTP system and a vulnerable one is usually 50 lines of careful code.

Did this rewire how you think about OTP security? If it landed, tap the ❤️ — that's how I know it hit.