AI layer view — showing only the AI integration points (risk scoring + adaptive rate limiting). The base OTP design is hidden — toggle back to Basic Infra to read it.
Read this with the framework in mind
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
Step 1
Clarify Requirements
Always open an HLD by aligning on scope. OTP is a single building block — but different products treat it very differently (2FA vs payments vs password reset). Pin these down first.
✅ Functional Requirements
- Generate a random 4–6 digit numeric code
- Deliver via SMS, email, or authenticator app
- Store with a short TTL (2–10 min by use case)
- Verify submitted OTP against stored value
- Enforce single-use — delete after success
- Invalidate after N failed attempts
- Rate-limit generation & verification independently
- Audit every OTP event (generation, attempt, success, failure)
⚡ Non-Functional Requirements
- Latency: verify p99 < 50 ms (Redis in-memory)
- Availability: 99.95% — auth blocks logins
- Durability: audit log must survive Redis loss
- Security: plaintext OTP never persisted, never logged
- Compliance: PCI-DSS, GDPR, HIPAA, RBI-ready
- Scale: handle SMS bombing & brute-force under load
Out of scope (confirm with interviewer): SMS provider failover strategy, push-notification OTP, hardware tokens (YubiKey/TOTP apps are adjacent), OAuth/SSO, and session management after OTP success.
Step 2
Capacity & Scale Estimates
Back-of-envelope first — it drives storage, rate-limit tiers, and the Redis/Postgres split.
| Metric | Assumption | Result |
| MAU | 50 M | — |
| OTPs / active user / day | ~3 (login + sensitive actions) | ~150 M/day |
| Peak QPS (3× avg) | 150 M ÷ 86 400 × 3 | ~5 200 QPS |
| OTP payload (hash + metadata) | ~200 B per record | — |
| Peak active OTPs in Redis (5-min TTL) | 5 200 × 300 | ~1.5 M keys ≈ 300 MB |
| Audit rows / year (Postgres) | 150 M × 365 | ~55 B rows/yr → partition monthly |
| SMS cost (even at ₹0.10/SMS) | 150 M/day | ₹15 M/day → rate limits matter economically, not just for security |
Takeaways: Redis easily fits active OTPs in RAM. Postgres audit needs partitioning (or TimescaleDB / Cassandra for very high scale). Rate limiting protects SMS spend as much as it protects users.
Step 3
Actors & Use Cases
Who triggers OTP flows and what they're trying to accomplish. Color-coded by actor.
flowchart LR
U([User])
PS([Payment Service])
AS([Auth Service])
SEC([Security Team])
U --> L[Login / 2FA]
U --> PR[Password Reset]
U --> AC[Account Change]
PS --> PAY[Payment Confirm]
AS --> REG[Registration Verify]
SEC --> AUD[Audit & Investigate]
style U fill:#e8743b,stroke:#e8743b,color:#fff
style PS fill:#4a90d9,stroke:#4a90d9,color:#fff
style AS fill:#38b265,stroke:#38b265,color:#fff
style SEC fill:#9b72cf,stroke:#9b72cf,color:#fff
🟢 Login / 2FA
Phone number → OTP → JWT. 5-min TTL, 3 regenerations per 15 min.
🔐 Password Reset
Email + reset token + OTP. 10-min TTL. Invalidate other sessions on success.
💳 Payment Confirm
Tied to transaction_id. 2-min TTL (shorter MITM window).
✉️ Account Change
Sent to the OLD contact — defends a compromised account.
📝 Registration
Verify email/phone ownership before provisioning the account.
🛡️ Audit / Investigate
Security team queries the audit log by user_id / ip — never sees OTP values.
Step 4
High-Level Architecture
Six moving parts. Redis is the hot store, Postgres the cold/audit store, the OTP service owns all hashing and verification.
flowchart LR
CL([Client App])
GW[API Gateway
per-IP rate limit
CAPTCHA]
OTP[OTP Service
generate / hash / verify]
RL[Rate Limiter
Redis counters]
R[(Redis
hashed OTP + TTL)]
PG[(PostgreSQL
audit log)]
SMS[SMS / Email
Provider]
MON[Monitoring
anomaly alerts]
CL -- HTTPS --> GW
GW --> OTP
OTP --> RL
OTP --> R
OTP --> SMS
OTP -. async write .-> PG
OTP -. metrics .-> MON
RL --> R
style CL fill:#e8743b,stroke:#e8743b,color:#fff
style GW fill:#4a90d9,stroke:#4a90d9,color:#fff
style OTP fill:#38b265,stroke:#38b265,color:#fff
style R fill:#9b72cf,stroke:#9b72cf,color:#fff
style PG fill:#9b72cf,stroke:#9b72cf,color:#fff
style SMS fill:#d4a838,stroke:#d4a838,color:#000
style MON fill:#3cbfbf,stroke:#3cbfbf,color:#000
🔷 Why Redis as primary
- Native TTL — no cleanup cron needed
- Sub-ms verify latency
- Atomic
DECR for attempts-remaining
- Cluster-shardable by user_id
🗄️ Why Postgres for audit
- Durability + compliance retention
- Rich queries for fraud investigation
- Survives Redis cluster loss
- Partition by month for B-row scale
Key design call: the write to Postgres is async (Kafka or fire-and-forget). Auth latency cannot wait on an RDBMS round-trip.
Step 5
API Design
Two endpoints carry 99% of traffic. Keep the surface area small — fewer endpoints = fewer attack vectors.
POST /v1/otp/generate
// Request
{
"identifier": "+919876543210", // phone OR email
"purpose": "LOGIN", // LOGIN | RESET | PAYMENT | UPDATE
"context": { "transaction_id": "txn_abc" } // optional, binds OTP to a resource
}
// 200 — OTP queued for delivery
{
"otp_id": "uuid-v4",
"expires_at": 1735128300,
"attempts_left": 5,
"cooldown_sec": 30 // client must wait before "resend"
}
// 429 — rate limited (generic, no user-enumeration)
{ "error": "TOO_MANY_REQUESTS", "retry_after": 120 }
POST /v1/otp/verify
// Request
{
"otp_id": "uuid-v4",
"code": "847231",
"context": { "transaction_id": "txn_abc" }
}
// 200 — success (one-shot; server deletes OTP immediately)
{ "verified": true, "session_token": "jwt..." }
// 401 — mismatch (generic; no "wrong code" vs "expired" leak)
{ "verified": false, "attempts_left": 3 }
// 410 — OTP gone (expired, used, or blown through attempt limit)
{ "error": "OTP_NOT_ACTIVE" }
API hardening notes: responses are deliberately uniform — the same error shape for "wrong code" and "expired code" prevents probing. otp_id is opaque (UUID, not the phone number) so the client request can be logged without leaking the identifier.
Step 6
Data Model — Redis + Postgres
Two stores, two very different jobs.
Redis (hot path)
Key / Value layout
# Active OTP record
KEY otp:{otp_id}
VALUE {
"user_id": "u_1234",
"identifier_hash": "sha256(+919876543210)",
"purpose": "LOGIN",
"hash": "$2b$12$...", # bcrypt hash of the OTP
"attempts_left": 5,
"context": { "transaction_id": "txn_abc" },
"created_at": 1735128000
}
TTL 300 # seconds — auto-deletes, no cron needed
# Rate-limit counters (sliding-window via INCR + EXPIRE)
KEY rl:gen:user:{user_id} TTL 900 # 3 per 15 min
KEY rl:gen:ip:{ip} TTL 3600 # 10 per hour
KEY rl:ver:user:{user_id} TTL 3600 # 10 per hour
KEY rl:ver:ip:{ip} TTL 3600 # 20 per hour
Postgres (audit, durable)
erDiagram
USER ||--o{ OTP_AUDIT : generates
OTP_AUDIT ||--o{ OTP_ATTEMPT : "has many"
USER {
uuid user_id PK
string email
string phone_hash
timestamp created_at
}
OTP_AUDIT {
uuid otp_id PK
uuid user_id FK
string purpose
string identifier_hash
timestamp created_at
timestamp expires_at
string outcome "GENERATED|VERIFIED|EXPIRED|EXHAUSTED"
string ip_address
string user_agent
}
OTP_ATTEMPT {
bigserial attempt_id PK
uuid otp_id FK
timestamp attempted_at
boolean success
string ip_address
string failure_reason
}
⚠️ Critical: neither the OTP code nor its hash is ever written to Postgres. The audit log tells you who tried to verify what and when — not what the secret was. This matters for GDPR/PCI auditors and for insider-threat containment.
Step 7
Generate OTP — Sequence Flow
Happy path + the three places it can fail loudly.
sequenceDiagram
actor U as User
participant GW as API Gateway
participant S as OTP Service
participant RL as Rate Limiter
participant R as Redis
participant SMS as SMS Provider
participant Q as Audit Queue
U->>GW: POST /otp/generate {phone, LOGIN}
GW->>GW: per-IP rate limit
GW->>S: forward
S->>RL: INCR rl:gen:user + rl:gen:ip
alt limit exceeded
RL-->>S: over limit
S-->>U: 429 TOO_MANY_REQUESTS
else within limit
S->>S: generate 6-digit OTP (SecureRandom)
S->>S: bcrypt hash(otp, cost=12)
S->>R: SET otp:{uuid} {hash, ...} EX 300
S->>SMS: send plain OTP
S-)Q: enqueue audit event
S-->>U: 200 {otp_id, expires_at}
end
🎲 Randomness
SecureRandom / crypto.randomInt / secrets.randbelow. Never Math.random().
🔒 Hash cost
bcrypt work factor 12 (~250 ms). Slow on purpose — defeats offline GPU cracking.
📮 Delivery
SMS send is synchronous for UX feedback, but provider timeouts fail fast (2 s budget).
Step 8
Verify OTP — Sequence Flow
Constant-time comparison + atomic attempt decrement are the two things people get wrong.
sequenceDiagram
actor U as User
participant GW as API Gateway
participant S as OTP Service
participant R as Redis
participant Q as Audit Queue
U->>GW: POST /otp/verify {otp_id, code}
GW->>S: forward (rate limited)
S->>R: GET otp:{otp_id}
alt not found
R-->>S: nil
S-->>U: 410 OTP_NOT_ACTIVE
else found
R-->>S: {hash, attempts_left}
S->>S: bcrypt.verify(submitted, hash) %% constant-time
alt match
S->>R: DEL otp:{otp_id} %% single-use
S-)Q: audit VERIFIED
S-->>U: 200 {verified:true, jwt}
else mismatch
S->>R: HINCRBY attempts_left -1
alt attempts_left == 0
S->>R: DEL otp:{otp_id}
S-)Q: audit EXHAUSTED
S-->>U: 410 OTP_NOT_ACTIVE
else still alive
S-)Q: audit FAILED_ATTEMPT
S-->>U: 401 {attempts_left}
end
end
end
⭐ The DEL on success is non-negotiable. If you only mark is_used=true and forget to delete, a race between two concurrent verifies can double-authenticate. Deleting is atomic; flags are not.
Step 9
Why Hash OTPs (and Why Plaintext Is a Career-Ender)
This is the single most common OTP mistake in production systems. Here's every way plaintext storage betrays you.
💣 Database breach
Redis exposed? Every live OTP is instantly usable. No cracking — just read and replay. Account-takeover at wire speed.
👤 Insider access
Any engineer with read access can impersonate any user, and the audit trail looks identical to a legitimate login.
📜 Log leaks
Plaintext OTPs show up in Sentry, Datadog, slow-query logs, CloudWatch, backup dumps. One log.debug(req.body) is all it takes.
📋 Compliance
PCI-DSS, GDPR, HIPAA, RBI all prohibit plaintext authentication data. Audit failure → fines + forced remediation.
🧠 Memory dumps
A core dump of the app server exposes any OTP sitting in RAM.
🌐 Side channels
Metrics and APM traces often capture request payloads. Hashing at the edge closes that leak surface.
The rule: the plaintext OTP exists in exactly two places — the SMS message and the user's head. The server persists only the bcrypt/Argon2 hash. Between generation and delivery it sits in memory for milliseconds; after that, only the hash remains.
Hash algorithm choice
| Algorithm | Verdict | Why |
MD5 / SHA-1 | ❌ Never | Broken; GPU can brute-force 6-digit space in seconds |
SHA-256 (plain) | ❌ Never | Too fast — 1M combinations cracked in <1 sec |
bcrypt | ✅ Fine | Work factor 10–12 slows attackers; battle-tested |
scrypt | ✅ Good | Memory-hard — resists custom ASIC attacks |
Argon2id | ✅ Best | Modern, tunable memory + time + parallelism |
Step 10
5-Layer Brute Force Defense
A 6-digit OTP has only 1 000 000 combinations. Without layered defense, an attacker scripts 10 req/sec and is in within ~28 hours. Here's the layered wall.
| Layer | Scope | Limit | Action on breach |
| 1 — Per-OTP attempts | Single otp_id | 5 wrong tries | Invalidate OTP |
| 2 — Per-user rate | user_id | 10 verifies / hour | 30-min account lock |
| 3 — Per-IP rate | Source IP | 20 verifies / hour | Temporary IP block |
| 4 — Exponential backoff | Per-user per-OTP | 0s → 5s → 30s → 2m | Kills scripting speed |
| 5 — CAPTCHA | Gateway | After 3 failed attempts | Forces human proof |
Defense in depth: one layer can be bypassed (e.g., rotating IPs defeats Layer 3). All five together make brute force economically infeasible — attackers move on to softer targets.
SMS bombing defense (generation side)
🙋 Per user
Max 3 OTP generations per 15 minutes.
🌐 Per IP
Max 10 generations per hour.
⏳ Resend cooldown
Reject "resend" < 30 s after previous send.
Step 11
OTP Lifecycle — State Machine
Every OTP follows exactly one path from birth to death. Four terminal states.
stateDiagram-v2
[*] --> GENERATED : POST /generate
GENERATED --> VERIFIED : code matches
GENERATED --> FAILED_ATTEMPT : code mismatches (>0 attempts left)
FAILED_ATTEMPT --> GENERATED : still alive
FAILED_ATTEMPT --> EXHAUSTED : attempts_left == 0
GENERATED --> EXPIRED : TTL elapses
VERIFIED --> [*]
EXHAUSTED --> [*]
EXPIRED --> [*]
Why model it explicitly? Every outcome must map to exactly one audit row — no "in between" states. This is what makes fraud investigation tractable: every OTP that ever existed has a terminal state on record.
Step 12
Use Case Variants
Same machinery, four different policies. TTL and delivery channel are the main knobs.
| Use case | TTL | Channel | Binding | Special |
| Login / 2FA | 5 min | SMS | user_id | Issue JWT on success |
| Password reset | 10 min | Email | OTP + reset-token | Kill all other sessions |
| Payment confirm | 2 min | SMS | transaction_id | Lock against replay |
| Account change | 3 min | OLD email/phone | user_id | Notify old contact after success |
💳 Why Payment TTL is shortest
Short window = smaller MITM attack surface. Payment OTPs must also bind to transaction_id — an OTP generated for a ₹500 transfer cannot be replayed on a ₹50 000 one.
✉️ Why Account-Change OTP goes to OLD contact
If the attacker has already taken over the account, they control the current contact. Sending OTP to the previous email/phone forces them to also own that — raising the attack bar significantly.
Step 13
Common Mistakes & How to Dodge Them
Every single one of these has been shipped to production by a real team at some point.
❌ Predictable randomness
Math.random(), timestamp-seeded PRNGs, or incrementing counters. Use a CSPRNG.
❌ Long TTLs
Anything > 10 min is an attack window. Match TTL to use case.
❌ Same code on resend
Regenerate on every resend. Reusing the same OTP halves the rotation benefit.
❌ No attempt cap
OTP must die after N failures — the attack must not be able to retry forever within the TTL.
❌ Logging OTP values
Scrub request-body logging for any field named code, otp, pin. Add allow-list logging by default.
❌ Fast hash (SHA-256 raw)
Use bcrypt/scrypt/Argon2. A 6-digit space falls to fast hashes in milliseconds.
❌ Non-constant-time compare
== on hashes leaks timing. Use crypto.timingSafeEqual or the library's built-in verify.
❌ Forgetting single-use
DEL on success. Otherwise the attacker sniffs the OTP and replays it after the user consumes it.
Step 14
Monitoring & Alerts
OTP systems are where anomalies appear first — monitoring them buys early warning for larger attacks.
📈 Signals worth paging on
- OTP generation rate > 3× baseline → SMS bombing
- Verify-failure rate > 10% per minute → brute force
- > 50 OTPs / hour from one IP → bot farm
- Verification from geography user never logs in from
- SMS provider 5xx > 2% → rotate to secondary
📊 Signals worth dashboarding
- Verify p50 / p99 latency
- Generate → verify conversion funnel
- Median time-to-verify (UX proxy)
- Attempts-left histogram (brute-force smell test)
- SMS cost per DAU
AI Layer ①
Risk-Based Authentication — Skip-OTP Intelligence
Before sending an OTP, first ask: is this login trustworthy? Trusted users skip the OTP entirely. Normal users get the standard OTP. Suspicious ones get extra checks (biometric + OTP). The OTP system itself stays exactly the same — we just put a smart filter in front of it.
💡 Think of it like a smart bouncer at a club. Today, everyone walking in has to show ID — even regulars (= an OTP for every login). A smart bouncer would: wave through known regulars (skip OTP), ask normal customers for ID (standard OTP), and demand extra proof from anyone who looks off (force step-up). The "AI" here is just a pattern-matcher that learned from past logins to score how risky each new one looks (0 = trusted, 100 = very suspicious).
🎓 New to AI? Plain-English glossary of terms used across all three AI sections:
- AI / ML model — a really good pattern-matcher trained on past data. Show it thousands of real vs. fraudulent logins; it learns to tell them apart.
- XGBoost / LightGBM — popular AI techniques that work like a giant decision flowchart with hundreds of branches. Fast and easy to debug — that's why we use them here.
- Feature store (Feast / Tecton) — a database holding all the signals we collect about users: device, location, login history, typing patterns. The AI reads from here to make a decision.
- Inference — fancy word for "running the AI to get a prediction." Each call takes ~20–30 ms.
- SHAP values — the AI's reasoning for a decision. Like: "flagged because new country (40%) + new device (35%) + unusual time (25%)." Helps debug AND satisfies regulators who require explainability.
- Isolation Forest — an AI technique great at finding "odd ones out" — unusual data points hidden among normal traffic.
- Prophet — AI that predicts what normal traffic should look like (think weather forecasting). Alerts when reality deviates from the forecast.
- Drift — when user behavior changes over time and the AI's old understanding becomes outdated. We monitor for it and retrain.
- Shadow mode — running the AI silently in the background for 2–4 weeks before turning it on for real, to confirm its decisions would have been good.
- p99 < 30 ms — 99% of AI decisions complete in under 30 milliseconds. Fast enough that users won't notice the AI is even there.
- Behavioral biometrics — how a person types, swipes, holds their phone. Every person has a unique pattern, hard to fake.
- Device fingerprint — a "digital signature" of a device made from browser, OS, screen size, timezone, fonts, etc. Lets us recognize the same device across sessions without a cookie.
🪡 Where AI plugs in — at the front door (API gateway). Every login request passes through the gateway first. We add one new step there: ask the AI
"how risky is this login?" Based on the answer (a number from 0 to 100), the gateway routes the request:
score < 30 (looks safe) → log the user in directly, no OTP needed
30 – 70 (medium) → send the usual OTP, business as usual
score ≥ 70 (looks suspicious) → demand more proof (OTP plus fingerprint or face)
Big picture — the AI sits between the user and the OTP service
flowchart LR
CL([Client])
GW[API Gateway]
RS[Risk Scoring
XGBoost · p99 < 30ms]
FS[(Feature Store
Feast / Tecton)]
OTP[OTP Service]
SESS[Issue Session
skip OTP]
STEP[Step-up
Biometric + OTP]
PG[(Postgres
+ training labels)]
CL --> GW
GW --> RS
RS --> FS
RS -- score < 30 --> SESS
RS -- 30 - 70 --> OTP
RS -- > 70 --> STEP
OTP -. async .-> PG
PG -. daily retrain .-> FS
style RS fill:#9b72cf,stroke:#9b72cf,color:#fff
style FS fill:#3cbfbf,stroke:#3cbfbf,color:#000
style OTP fill:#38b265,stroke:#38b265,color:#fff
Step-by-step — what happens during a login (with AI added)
sequenceDiagram
actor U as User
participant GW as API Gateway
participant RS as Risk Service
participant FS as Feature Store
participant OTP as OTP Service
participant SESS as Session Svc
participant STEP as Step-up Svc
U->>GW: POST /auth/start {phone, device, ctx}
GW->>RS: score(user, device, ip, ua)
RS->>FS: fetch features
FS-->>RS: ~80 features (≤10ms)
RS->>RS: XGBoost inference (≤20ms)
RS-->>GW: {risk_score, shap}
alt risk_score < 30 (trusted)
GW->>SESS: issueSession(user_id)
SESS-->>U: 200 {jwt} — OTP skipped
else 30 ≤ score < 70 (medium)
GW->>OTP: forward to /otp/generate
Note over OTP: standard generate · hash · store
OTP-->>U: 200 {otp_id, expires_at}
else score ≥ 70 (high)
GW->>STEP: requireStepUp([OTP, BIOMETRIC])
STEP-->>U: 401 {challenge_id, factors}
end
📥 What signals the AI looks at
- Device fingerprint — a digital signature of the user's phone or laptop
- Where they're coming from — IP address, network, distance from where they usually log in
- How fast they're trying — number of logins in the last hour from this user / device / IP
- Is the time normal? — e.g., 3 am login when they usually log in at 9 am is suspicious
- How they type and tap — every person has a unique rhythm; very hard to fake
- Where they're logging in from — web, mobile app, or external partner
🧠 What kind of AI we use
- XGBoost — a popular, fast, accurate AI model that can explain its own decisions
- Trained on 3 months of past login data — labelled as real or fraudulent
- Output: a risk score from 0 to 100 (higher = more suspicious)
- Hosted on AI-serving infrastructure — 99% of decisions in under 30 ms
- Reads user signals from the feature store in under 10 ms
Decision policy
// Risk-tiered auth routing
async function startAuth(req) {
const features = await featureStore.fetch(req.user_id, req.device_id);
const { risk_score, shap } = await riskService.score(features, { timeout: 30 });
// audit the decision (without leaking raw features)
audit.emit({ user_id: req.user_id, risk_score, shap_top3: shap.slice(0, 3) });
if (risk_score < 30) return issueSession(req.user_id); // trusted → skip OTP
if (risk_score < 70) return generateOtp(req.user_id, "LOGIN"); // medium → standard OTP
return requireStepUp({ factors: ["OTP", "BIOMETRIC"] }); // high → step-up
}
🔁 The AI keeps learning over time. Every login outcome (succeeded, failed, user gave up, later reported as fraud) is fed back to the AI as a labelled example. Once a day, the model retrains so it stays sharp as attackers change their tactics.
🛟 What if the AI breaks? If the AI service is slow or down, we don't lock users out — we just fall back to send OTP to everyone (the basic flow). Rule: never let a broken AI block legitimate users, but also never let it pretend everything's safe when it isn't. Hard timeout — if the AI doesn't answer in 30 ms, we skip it for that login.
AI Layer ②
Adaptive Rate Limiting — Beyond Static Thresholds
Smart rate limiting that learns what normal traffic looks like at every hour of every day. Catches sneaky attacks (1000 different IPs each making 1 attempt — invisible to per-IP limits) and stays quiet during real traffic spikes like Black Friday or a viral app launch.
💡 Think of it like a smart spam filter. Old email rules say "block any sender who emails you 100 times." Spammers got smart — now they send 1 email each from 1000 different addresses, slipping under the limit. A modern spam filter spots the overall pattern (writing style, suspicious links, sending behavior) even when no single sender breaks the rule. Same idea here: the AI watches the shape of incoming OTP traffic and catches attacks even when no individual user crosses a static threshold.
🪡 Where AI plugs in — at two places.
- Place 1 — at the front door (API gateway): right next to the existing rule that limits how many requests come from one IP. The AI looks at the bigger picture and says "allow / show captcha / slow down / block." The gateway uses both the old rule AND the AI's view.
- Place 2 — inside the OTP service's rate limiter: right next to the per-user counters that say "this user has tried 3 times in 15 minutes." The AI's score is added as a second opinion.
Both places are
additive — the old simple rules stay as a safety floor (always work, easy to debug). The AI catches the sneaky stuff the simple rules miss.
Step-by-step — what happens during an OTP request (with AI added)
sequenceDiagram
actor U as User
participant GW as API Gateway
participant AE as Anomaly Engine
participant OTP as OTP Service
participant RL as Rate Limiter
participant R as Redis
U->>GW: POST /otp/generate
Note over GW: Hook A
GW->>GW: static per-IP check (5-layer floor)
GW->>AE: score(req/min, IP entropy, UA diversity, fail ratio)
AE-->>GW: {anomaly_score, action}
GW->>GW: combine(static, anomaly) → verdict
alt verdict = allow
GW->>OTP: forward
Note over OTP,RL: Hook B
OTP->>RL: INCR rl:gen:user + rl:gen:ip
OTP->>AE: score(per-user features)
AE-->>OTP: {user_anomaly}
OTP->>OTP: combine → admit
OTP->>R: SET otp:{id} {hash} EX 300
OTP-->>U: 200 {otp_id, expires_at}
else verdict = captcha
GW-->>U: 401 + captcha challenge
else verdict = throttle / block
GW-->>U: 429 / 403 + SOC alert
end
🔍 Real-time "odd one out" detection
- Uses Isolation Forest — an AI good at spotting unusual patterns
- Watches signals like: requests per minute, how many different IPs / browsers, success vs failure ratio
- Catches sneaky attacks where 1000 different IPs each try once — invisible to per-IP rules
- Looks at the last 5 minutes of traffic, refreshing every 30 seconds
📈 "What's normal traffic right now?" forecasting
- Uses Prophet — an AI that predicts traffic patterns (like weather forecasting)
- Knows a Black Friday 8 pm spike is normal — won't trigger an alarm
- Knows a Tuesday 3 am spike is not normal — flags it
- Stops the chronic "3 am false alarm that wakes the on-call engineer"
Decision pipeline
flowchart LR
REQ([Request]) --> SR[Static Rules
deterministic floor]
REQ --> AE[Anomaly Engine
Isolation Forest
+ Prophet baseline]
SR --> JOIN{Combine}
AE --> JOIN
JOIN -- score < 0.3 --> ALLOW[Allow]
JOIN -- 0.3 - 0.7 --> CAP[Invisible CAPTCHA]
JOIN -- 0.7 - 0.9 --> THR[Throttle
+ MFA prompt]
JOIN -- > 0.9 --> BLK[Block + SOC alert]
style AE fill:#9b72cf,stroke:#9b72cf,color:#fff
style SR fill:#38b265,stroke:#38b265,color:#fff
📌 Important — the AI adds to the old rules, it doesn't replace them. The simple static rules stay as a safety floor: predictable, easy to debug, easy to explain to auditors. The AI sits on top and catches the things simple rules miss. If the AI breaks, the floor still works — nothing falls through.
Things to watch out for in production
📊 Watch for "drift"
User behavior changes over time, so the AI's old understanding can get stale. Alert if signals drift too far from what the AI learned. Run A/B tests every week comparing the live model to a newer one.
🧪 Run silently first ("shadow mode")
Before letting the AI block real users, run it for 2 weeks just watching — logging what it would have done. Then compare with what actually happened.
🔐 Protect user privacy
Never store sensitive info in plain text. Hash IPs and personal data before saving. Keep recent data 30 days, archived data 6 months — GDPR rules.
⚖️ Be able to explain decisions
For every high-risk block, log why the AI made that call (e.g., "new country + new device + odd time"). Required by GDPR if a user asks "why was I blocked?"
🧯 Have an off switch
One feature flag must instantly turn the AI off and fall back to the simple rules. Test this disaster scenario regularly so you know it works.
💰 Cost vs. benefit
Each AI decision costs ~₹0.05. At 5,200 requests per second, that's ~₹22 K/day. But cutting SMS volume by 30–50% saves millions per day. Pays for itself many times over.
📝 The whole AI layer in one line: one AI decides "should we even send an OTP?", another AI decides "how aggressively should we block this traffic?" Both sit on top of the existing OTP system — nothing about the OTP itself (how it's generated, hashed, stored, used once) changes. Needs about 3 months of past login data to train initially.
AI Layer ③
Why Add It — Use Cases & Security Wins
A 6-digit OTP is a thin line of defense. Once a user is tricked into reading their OTP to an attacker, it's already over. Static rate limits are easy to map and bypass. The AI layer turns OTP from "challenge everyone the same way" into "challenge intelligently."
💡 In plain words: the basic OTP system is like a single locked front door — fine against casual intruders. The AI layer adds a security camera, a smart doorbell that recognizes regulars, and a doorman who notices when someone's casing the place. Each one alone helps; together they stop attackers your front door can't.
🛡️ Attacks defeated that base OTP cannot stop
🎣 OTP Phishing / Vishing
User is socially engineered into reading the OTP to an attacker on a fake support call. Plain OTP only checks "code matches?" and lets it through. Risk scoring catches the surrounding signal — new device, unusual geo, screen-share active, abnormal session timing — and demands step-up.
📱 SIM Swap
Attacker convinces the carrier to port the user's number. OTP delivers right into their hand. The AI layer notices: device fingerprint changed, IP from a new ASN, behavioral biometrics don't match → blocks the login and alerts the user via the old contact.
🌍 Credential Stuffing
Attacker has 1000 leaked passwords from another site and tries them all on your login from 500 rotating IPs. No single IP makes enough attempts to trigger the static limit. But the AI notices the bigger picture: way too many different IPs / devices, low success rate → blocks the whole pattern.
💣 Distributed SMS Bombing
Attacker hammers the "resend OTP" button from many rotating IPs — to drain your SMS budget or harass a specific user. Per-IP rules don't catch it. The AI's traffic forecast spots the unusual spike compared to normal hours → blocks and alerts the security team.
🐢 Slow-and-Low Brute Force
A patient attacker spreads guesses over many days to fly under any static "X attempts per hour" rule. The AI's forecast model knows that "20 attempts per hour at 3 am from this country" isn't normal — even when no single time window technically breaks any rule.
🤖 Bot Signup Fraud
Bots register thousands of fake accounts to claim signup bonuses or referral payouts. Behavioral biometrics catch them — the typing rhythm of a bot looks robotic and identical across accounts. Device fingerprints reveal the same cluster of fake "phones" being reused.
💡 Where the AI layer pays off most
🏦 Banking / Fintech
High-value transfers force step-up; small recurring ones skip OTP. Cuts SMS spend ~40% while reducing fraud losses. Aligns with the RBI risk-based AFA mandate.
🛒 E-commerce
Trusted devices skip OTP at checkout → fewer cart abandonments. Suspicious order velocity (gift cards, multiple shipping addresses) auto-escalates.
🏥 Healthcare
PHI access from unfamiliar devices/IPs gets stronger verification automatically — meeting the HIPAA "context-aware access" expectation.
💼 B2B SaaS
Admin actions and bulk-export operations get stepped up. Service-account login from a new IP triggers alert, not just OTP.
₿ Crypto / Exchanges
Withdrawals to a new address, above-threshold amounts, or first-time external wallets demand step-up — regardless of OTP success.
📞 Telco / Number Portability
SIM-swap defense for the carrier flow itself. Catches fraudsters attempting to take over the carrier account that delivers OTPs.
📊 Measured impact at scale
| Metric | Base OTP only | + AI layer | Delta |
| SMS volume / day | 150 M | ~75–105 M | −30 to −50% |
| SMS spend (@ ₹0.10/msg) | ₹15 M/day | ₹7.5–10.5 M/day | −₹4.5–7.5 M/day |
| Account takeover incidents | baseline | 0.2–0.4× baseline | −60 to −80% |
| OTP screens per active session | 1.0 | 0.5–0.7 | ~40% less friction |
| False-positive rate-limit pages | baseline | 0.1–0.3× baseline | −70 to −90% |
| ML inference cost | — | ~₹22 K/day @ 5 200 QPS | capex |
ROI in one line: ~₹22 K/day inference cost vs ~₹4.5–7.5 M/day SMS savings → the AI layer pays for itself 200× on SMS alone, before counting reduced fraud losses, lower support volume, and higher conversion from less friction.
📋 Regulatory tailwinds — risk-based auth is becoming required
Laws & standards already requiring this
- PCI-DSS 4.0 — payment card security standard. Now requires risk-based auth for online card payments.
- PSD2 SCA — Europe's payment law. Allows skipping a step only if the bank can show it's a low-risk transaction.
- RBI — India's central bank. Mandates risk-based auth for digital payments.
- NIST 800-63B — US government's authentication guidelines. Recommends "context-aware" (= risk-based) authentication.
Built-in support for audits & user rights
- For every high-risk block, log why the AI made that call (using SHAP — the AI's reasoning)
- Track when the AI's accuracy starts drifting and retrain regularly — keeps auditors happy
- If a user asks "why was I blocked?" (a right under GDPR), we can answer in plain language
The bottom line: the base OTP system protects against opportunistic attackers using common tools. The AI layer protects against motivated attackers who've studied your system — phishers, SIM-swappers, credential-stuffing operators, SMS-bombing fraudsters, and patient slow-burn brute-forcers. If your product handles money, identity, or PHI, this layer is no longer optional.
Step 15
Trade-offs & Interview Talking Points
The bits an interviewer wants to hear you reason about out loud.
| Decision | Alternative | Why this choice |
| Redis as primary store | Postgres only | Sub-ms latency + native TTL; audit still goes to Postgres async |
| Hash (bcrypt) | Plaintext / fast hash | Limits blast radius of any leak; slow hash beats GPU brute force |
| 6-digit numeric | Alphanumeric | UX on phone keypads; brute-force mitigated by rate limits, not entropy |
UUID otp_id | Use phone as key | Opaque ID — safer to log, no user enumeration |
| DEL on success | is_used = true flag | Atomic; avoids double-verify race |
| Async audit write | Sync write | Keeps verify off the RDBMS critical path |
| Short TTL (2–10 min) | 24-hour validity | Dramatically smaller attack window; fine for UX |
| Generic error responses | Distinguish wrong/expired | Prevents user & state enumeration attacks |
Step 16
Interview Q&A
The follow-ups that actually come up on system-design loops.
Why can't we just use HMAC(user_id, timestamp) as the OTP?
TOTP-style works when the client has a shared secret (Google Authenticator). For SMS delivery the user has no secret, so any formula visible on the server is also guessable from observed OTPs. Random + hash + TTL is the right primitive for SMS/email.
How do you handle Redis going down?
Run Redis in cluster mode with replicas; use WAIT to require replica ack before replying. On total cluster loss: fail-closed (reject new verifies) and rotate existing users through password re-verification — never fall back to Postgres as primary, because the audit store doesn't have hashes by design.
An attacker keeps generating OTPs for a victim's phone. What do you do?
Layered: (a) Layer-1 per-user 3-per-15-min cap, (b) client-side CAPTCHA after the 2nd request, (c) progressive cooldowns (30s → 2m → 10m) between resends, (d) trust-score on the IP — block IPs with high OTP-request/verify ratios.
Why bcrypt and not just PBKDF2-SHA256 with 100k iterations?
Functionally close, but bcrypt's algorithm is memory-bound (4KB working set) which raises the cost of custom ASICs. Argon2id is even better — memory-hard by design. PBKDF2 is fine if it's what your stack already uses; don't rewrite.
Single Redis or multi-region?
OTPs are short-lived and tied to a user's active session — region-local is fine. Don't pay cross-region replication cost. If a user roams mid-flow (rare), force regeneration in the new region.
What prevents an insider from extracting bcrypt hashes and brute-forcing offline?
bcrypt cost-12 ≈ 250 ms/attempt → 1M guesses ≈ 3 days on a single GPU, if you got a dump within the 5-min TTL. The TTL is the dominant defense here, not the hash. Combine with: access audits on Redis, at-rest encryption, and alerting on KEYS otp:* scans.
How do you support "voice call" OTP delivery as an option?
Same generate/verify flow — only the delivery adapter changes. Add a channel field (SMS | EMAIL | VOICE | WHATSAPP) and a Strategy interface per channel. Rate limits may need per-channel tuning (voice is expensive).
What about TOTP (Google Authenticator)?
Different trust model — shared secret installed at enrollment, no server-side per-request state. Typically offered alongside SMS OTP as a stronger option. Our system handles both by routing verification to different modules based on the factor the user has configured.
Step 17
Production Checklist
Before shipping any OTP system — tick every box.
- OTPs hashed with bcrypt (cost 12) or Argon2id — never plaintext
- TTL tuned per use case: login 5m, reset 10m, payment 2m, update 3m
- 5-layer rate limiting: per-OTP, per-user, per-IP, backoff, CAPTCHA
- Constant-time hash comparison (
crypto.timingSafeEqual or library verify)
- Audit log written to Postgres (without OTP values) via async queue
- Single-use enforcement via
DEL immediately on success
- Attempt counter drains via atomic
HINCRBY; OTP dies at 0
- Exponential backoff between failed attempts
- CAPTCHA triggered after 3 failures
- Redis with TLS in transit + AUTH enabled + at-rest encryption
- Notifications to user's old contact on sensitive changes
- Zero OTP values in app logs, APM, error trackers, slow-query logs
- Monitoring on generation spikes, failure rates, geography drift
- SMS provider health checks + automatic failover to secondary
- Generic error responses — no user/state enumeration leaks
- CSPRNG for OTP generation — never
Math.random()
The principle: OTP security isn't about cleverness — it's about discipline. Hash everything. Rate-limit everywhere. Keep TTLs short. Delete on success. The gap between a secure OTP system and a vulnerable one is usually 50 lines of careful code.