← Back to Design & Development
High-Level Design

Payment System

Stripe / PayPal style — idempotency, sagas, and a double-entry ledger that never loses a cent. The architecture, the failure modes, and why every box is non-negotiable when real money is on the wire.

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →
Step 1

What is a Payment System?

Sarah is checking out at 14:02:06 on a Tuesday. The cart says one pair of sneakers, $89.00. She taps "Buy". Within two seconds, four things must happen and they must happen together: her credit card gets charged $89, the merchant's escrow balance goes up by $89, the inventory drops by one pair, and a confirmation email lands in her inbox. If any of those four steps fails — and the rest succeed — somebody loses money. If we charge her card but never tell the merchant, Sarah paid for sneakers that won't ship. If we tell the merchant but the charge fails, the merchant ships sneakers we never collected for. If Sarah retries because her Wi-Fi blinked, we must not charge her twice.

That's what a payment system is: a piece of software whose entire purpose is moving money between three parties — the buyer's bank, the merchant's bank, and the platform's escrow account — without ever creating, destroying, or duplicating a single cent. Stripe, PayPal, Adyen, Razorpay all solve this same problem at scale.

The two questions that drive every design decision below: (1) How do we make a multi-step money transfer atomic across systems we don't control (card networks, banks, our own DB)? (2) How do we guarantee that retries — from network drops, server crashes, double-clicks — never result in a double-charge or a leak?
Why this is harder than a normal CRUD app: in a normal app, a failure means "user sees an error, retries, no harm done". In a payment system, a failure between step 2 (card charged) and step 3 (merchant credited) means real money has left Sarah's bank and is sitting in nobody's account. That's not a bug — that's a regulator-visible incident. So the architecture is built to make those middle-state losses literally impossible.
Step 2

Requirements & Goals

Before drawing a single box, pin down what the system must do — and explicitly what it does not. In an interview, asking these questions is half the score.

✅ Functional Requirements

  • Charge cards — accept a payment from a customer's card / wallet / bank
  • Refunds — partial or full reversal of a previous charge
  • Split payments — one charge, money goes to multiple recipients (marketplace use case)
  • Recurring subscriptions — auto-charge on a schedule
  • Payouts to merchants — move escrow money to merchants' bank accounts on a schedule
  • Multi-currency — accept USD, EUR, INR, etc.; FX conversion when needed

⚙️ Non-Functional Requirements

  • ACID-strict — no money ever lost or duplicated, period
  • Idempotent — every mutation safe to retry, same result every time
  • Highly available — payment downtime is revenue lost forever
  • Sub-second response — Sarah cannot wait 5 seconds at the checkout page
  • Audit-ready — every cent traceable to a state change with a timestamp
  • PCI-DSS compliant — card data handled per industry regulations

🚫 Out of Scope

  • Card-network internals — we integrate with Stripe / Visa / Mastercard, we do not be them
  • Fraud-model training pipeline — we consume the model, not build it
The non-functional requirements are the hard part. Charging a card via Stripe's API is a 5-line problem. Making the charge survive a network drop mid-flight, never double-charge on retry, balance to the cent against Stripe's daily statement, and stay PCI-compliant — that's the architecture.
Step 3

Capacity Estimation & Constraints

Numbers are not optional in HLD. They drive sharding, ledger sizing, and how big our database needs to be. Let's pick a Stripe-like mid-size scale.

Traffic estimates

Assume 1 million transactions per day, average ticket size $50, peak load is Black Friday at roughly 40× the daily average concentrated into a few hours.

Avg TPS

~12 TPS

1M / 86400

Peak TPS

~500 TPS

Black Friday spike

GMV / day

~$50M

1M × $50 avg

API latency

p99 < 1s

End-to-end checkout

Storage estimate (5-year regulatory retention)

Each transaction record (transaction + payment + ledger entries + audit log) is roughly 1 KB across the relevant tables. 1M × 1KB = 1 GB/day. Over 5 years (the typical regulatory retention window for financial records): ~2 TB total. With indices, replicas, and audit logs: ~6 TB provisioned.

Ledger entry volume

Every transaction writes at least 2 ledger entries (debit + credit), often more for split payments and FX. Realistic average is 3 entries per transaction. 1M × 3 = 3M ledger rows/day = ~1B rows/year. This is the dominant table by row count and drives the partitioning strategy in §14.

MetricValueWhy it matters
Avg TPS12/sDrives base sizing — small box can do this
Peak TPS500/sDrives autoscale ceilings & rate-limit budgets
GMV / day$50MDefines the cost-of-error — every minute of downtime is ~$35K
5-yr storage2-6 TBForces partitioning of the ledger by account_id
Ledger rows / yr~1 BAppend-only — no UPDATE, ever; only INSERT
Step 4

System APIs

Four mutating endpoints carry the bulk of the value, plus webhooks for the async events the merchant cares about. Note the Idempotency-Key header is mandatory on every mutation — this is the contract that lets clients retry safely.

REST API surface
// Create a payment — mutation, requires Idempotency-Key
POST /v1/payments
Headers: { "Idempotency-Key": "1f3b9c2a-..." }
{
  "amount":            8900,           // in cents
  "currency":          "USD",
  "payment_method_id": "pm_abc123",    // Stripe-style token, no PAN
  "customer_id":       "cus_sarah42",
  "merchant_id":       "mer_zappos99",
  "description":       "Order #5512 — sneakers"
}
→ 201 Created  { "id": "pay_...", "status": "succeeded", ... }

// Refund — mutation, requires Idempotency-Key
POST /v1/refunds
Headers: { "Idempotency-Key": "..." }
{ "payment_id": "pay_...", "amount": 8900, "reason": "requested_by_customer" }
→ 201 Created

// Read a payment — safe, no key needed
GET /v1/payments/:id
→ 200 OK { ... }

// Payout to a merchant's bank — mutation, requires Idempotency-Key
POST /v1/payouts
Headers: { "Idempotency-Key": "..." }
{ "merchant_id": "mer_zappos99", "amount": 50000, "currency": "USD" }
→ 201 Created

// Webhooks — async events fired to merchant's HTTPS endpoint
POST → merchant.example.com/webhook
{ "type": "payment.succeeded", "data": { ... } }
   // Retried with exponential backoff for up to 3 days on non-2xx
Why Idempotency-Key is mandatory on every mutation: Sarah's phone might lose signal right after sending the request. Her browser auto-retries. Without an idempotency key, the second request is indistinguishable from a fresh charge, and Sarah pays $178 for one pair of sneakers. With the key, the second request is recognized as a replay, the original result is returned, and she pays $89 once. The key is a UUID generated by the client and included in the request header — analogous to a receipt number on a paper invoice.
Tokens, not PANs: the API never accepts raw card numbers (the PAN — Primary Account Number, the 16-digit digits on the front of the card). The client SDK posts the card directly to Stripe's hosted iframe and gets back an opaque pm_... token. Our backend only ever sees the token. This is the single biggest PCI-DSS scope reducer (more in §12).
Step 5

Database Schema

A payment system has four core tables, and the relationships matter enormously. Account represents any party that holds a balance — customer, merchant, the platform's own escrow, the platform's revenue. Transaction is a single business event ("Sarah pays $89 to Zappos"). Payment attaches the payment-method specifics. LedgerEntry is the append-only double-entry record — the source of truth for every cent. We will use Postgres with strict serializability here; the trade-offs against NoSQL are covered in §15.

erDiagram ACCOUNT { string id PK bigint balance_cents string currency string customer_id FK string type } TRANSACTION { string id PK string idempotency_key UK string source_account_id FK string dest_account_id FK bigint amount_cents string status timestamp created_at } PAYMENT { string id PK string transaction_id FK string payment_method_id string status string error_code } LEDGER_ENTRY { string id PK string account_id FK string transaction_id FK bigint amount_cents string entry_type timestamp created_at } ACCOUNT ||--o{ TRANSACTION : "source" ACCOUNT ||--o{ TRANSACTION : "dest" TRANSACTION ||--|| PAYMENT : "has" TRANSACTION ||--o{ LEDGER_ENTRY : "produces" ACCOUNT ||--o{ LEDGER_ENTRY : "owns"

Why each table looks the way it does

🔑 idempotency_key UNIQUE on Transaction

The single most important constraint in the schema. Even if the Redis idempotency cache is down or evicted, the database will reject a duplicate insertion with a primary-key violation. This is your last line of defense against double-charges, and it is enforced at the storage layer not the application — meaning even a buggy application server cannot bypass it.

📒 LedgerEntry is append-only

No UPDATE, no DELETE. Ever. Reversing a transaction does not delete the original entries — it adds new entries that compensate. This gives us a perfect audit trail: the entire history of every account is reconstructable by replaying the ledger from the first day. Regulators love this; engineers learn to love it after their first incident.

💰 balance_cents as a derived snapshot

The balance on Account is not the source of truth — it's a cached snapshot derived from SUM(amount_cents) FROM ledger_entry WHERE account_id = ?. The ledger is truth; the balance is convenience. This is what makes the system reconcilable.

💳 Payment ≠ Transaction

A transaction is a business event ("$89 from Sarah to Zappos"); a payment is the mechanism ("Visa card ending 4242, auth code XYZ, captured at 14:02:06"). One transaction has exactly one payment, but separating them lets us swap the payment mechanism (card → wallet → bank transfer) without changing the business semantics.

The audit invariant — write it on the wall: for every transaction, the sum of its ledger entries must equal zero (debits and credits balance). Globally, SUM(amount_cents) FROM ledger_entry across all accounts must equal zero too. If that sum ever diverges from zero, money has been created or destroyed in our system — that is a P0 incident that wakes up the on-call engineer.
Step 6 · CORE

High-Level Architecture — From Naive to Production

This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it shatters the moment real money flows through it, and the production shape where every box exists to plug a specific failure mode.

Pass 1 — The naive design (and why it breaks)

One app server. It receives the checkout POST, charges the card via Stripe's API, then in the same request handler updates the merchant's balance in the DB, then sends a confirmation email, then returns 200 to the browser. Three calls in series, one happy path.

flowchart LR C["Sarah's Browser"] --> APP["App Server"] APP -->|"1. charge"| STR["Stripe API"] APP -->|"2. update balance"| DB[("MySQL — merchant balance")] APP -->|"3. email"| EM["SMTP"] APP --> C

Four catastrophic failure modes show up the moment this hits production:

💥 Network drop after step 1, before step 2

Stripe charged Sarah's card for $89. Our app server crashed before writing the merchant balance. Sarah's bank statement shows the charge. Our DB shows nothing. The merchant has no idea Sarah paid and never ships sneakers. Sarah is angry, Zappos is angry, and nobody can find the $89 — it's sitting in our Stripe account with no internal record. Money has effectively leaked out of the system.

💥 Sarah retries — double charge

Browser timed out after step 1, Sarah hits "Buy" again. Our app receives a fresh request that looks identical. We run the whole flow again — Stripe charges Sarah twice. Now $178 has left her account for one pair of sneakers. She files a chargeback, the bank yanks back $89 plus a $25 dispute fee, and we still owe Zappos for the sneakers they shipped.

💥 Hot-row contention on merchant balance

Black Friday: Zappos receives 500 orders per second. Every order does UPDATE merchant SET balance = balance + amount WHERE id = 'zappos'. Postgres serializes these updates on the same row. Lock queue grows, p99 latency climbs from 50ms to 5 seconds, and Sarah sees a spinner on her checkout page.

💥 No audit trail, no recovery path

A single UPDATE balance overwrites history. We can never answer "what was Sarah's balance at 14:02:05?" or "did this charge actually happen?" Regulators want a ledger of every state transition. Without one, we cannot pass an audit and we cannot debug incidents — we can only guess.

Pass 2 — The mental model: Idempotency + Saga + Double-Entry Ledger

The production design is built on three ideas. Each one solves exactly one of the failure modes above. Get these three right and the architecture writes itself.

🎟️ Idempotency Key

Every mutation request carries a UUID generated by the client — a receipt number. The server keeps a record of every key it has seen. If the same key shows up twice, the server returns the original result instead of executing the operation again. Same as a hotel handing back the same room key when you re-show the same booking confirmation — they don't re-book the room.

Solves: double-charge on retry. Sarah can mash the Buy button 100 times — the first request runs, the next 99 return the same result, no extra money moves.

🔄 Saga Pattern

A payment is not one operation — it is a workflow: authorize, then capture, then ledger update, then notify. Each step has a paired compensating action (auth → reverse-auth, capture → refund, ledger → reversing entry). An orchestrator drives the workflow, retrying transient failures, and on permanent failure runs the compensating actions in reverse to undo the partial work. Like a recipe with explicit "if you've already cracked the eggs but ran out of flour, throw the eggs out" instructions.

Solves: partial-failure leaks (Pass 1 problem #1). No state where the card was charged but the merchant wasn't credited.

📒 Double-Entry Ledger

Every transaction writes two entries that sum to zero — a debit on one account and a credit on another. Money is never created or destroyed; only moved. The system's invariant is mathematical: SUM(all ledger entries) = 0. Borrowed directly from 700-year-old accounting practice, because accountants solved this problem long before computers existed. Reconciliation becomes a single SQL query.

Solves: hot-row contention (Pass 1 problem #3) and audit gaps (Pass 1 problem #4). Append-only, no UPDATE locks; balances computed on demand from append history.

Crucially, these three ideas compose. Idempotency makes individual saga steps safe to retry. The ledger gives the saga's compensating actions a clean way to record reversals. The saga orchestrator commits ledger writes in transactions to guarantee step atomicity. Take any one out and the other two break — that is why all three appear in every serious payment system on the planet.

Pass 3 — The production shape

Now the full picture. Every node is numbered ①–⑬ — find its matching card below for what it does and what would break without it. The architecture is split into four planes by responsibility: Ingest (accept the request), Orchestration (drive the workflow), Ledger (record the truth), and Risk (decide whether to allow the transaction).

flowchart TB CL["① Client SDK — web / mobile checkout"] subgraph INGEST["Ingest Plane"] GW["② API Gateway"] API["③ Payment API Server"] IDC[("④ Idempotency Cache — Redis")] end subgraph ORCH["Orchestration Plane"] ORC["⑤ Payment Orchestrator — Temporal"] GA["⑥ Payment Gateway Adapter"] end subgraph LEDGER["Ledger Plane"] LS["⑦ Ledger Service"] LDB[("⑧ Ledger DB — Postgres serializable")] SNAP[("⑨ Account Snapshot Cache")] end subgraph RISK["Risk Plane"] FRAUD["⑩ Fraud Detection"] NOTIF["⑪ Notification Service"] REC["⑫ Reconciliation Service"] AUD[("⑬ Audit Log — S3 append-only")] end CL --> GW GW --> API API --> IDC API --> ORC ORC --> FRAUD ORC --> GA GA --> STRIPE["Stripe / Card Networks"] ORC --> LS LS --> LDB LS --> SNAP ORC --> NOTIF ORC -.events.-> AUD REC -.daily.-> LDB REC -.daily.-> STRIPE style CL fill:#e8743b,stroke:#e8743b,color:#fff style GW fill:#171d27,stroke:#9b72cf,color:#d4dae5 style API fill:#171d27,stroke:#e8743b,color:#d4dae5 style IDC fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style ORC fill:#171d27,stroke:#4a90d9,color:#d4dae5 style GA fill:#171d27,stroke:#4a90d9,color:#d4dae5 style LS fill:#171d27,stroke:#38b265,color:#d4dae5 style LDB fill:#171d27,stroke:#38b265,color:#d4dae5 style SNAP fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style FRAUD fill:#171d27,stroke:#e05252,color:#d4dae5 style NOTIF fill:#171d27,stroke:#d4a838,color:#d4dae5 style REC fill:#171d27,stroke:#d4a838,color:#d4dae5 style AUD fill:#171d27,stroke:#9b72cf,color:#d4dae5 style STRIPE fill:#0f1520,stroke:#7b8599,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram to find the matching card below. Each one answers what is it, why is it here, and what would break without it.

Client SDK

The Stripe.js or PayPal SDK loaded in Sarah's browser or mobile app. It does two critical jobs: (1) it tokenizes her card by posting the raw PAN directly to Stripe's hosted iframe, never to our servers — getting back a token like pm_abc123 that our backend can use without ever touching real card data. (2) It generates a fresh UUID for the Idempotency-Key header, locking in the receipt number before the user even clicks "Buy".

Solves: two huge problems at once — keeps PCI scope off our servers (§12) and gives us the idempotency primitive we depend on for safe retries.

API Gateway

The first thing inbound traffic hits. Terminates TLS, enforces rate limits per API key (e.g., a customer can't fire 1000 charges/sec from the same key), validates auth tokens, and forwards clean requests to the payment API server. AWS API Gateway, Kong, or Envoy all fit.

Solves: a misbehaving or malicious client trying to brute-force a fraud attempt. Without the gateway, every bad actor's request reaches our application logic — wasting CPU and risking DB exhaustion. With it, 99% of abuse is rejected at the edge.

Payment API Server

Stateless service. Validates the request body, looks up the customer and merchant, checks the idempotency key in Redis ④, and if it's a new key kicks off a workflow on the Orchestrator ⑤. Returns a synchronous response within ~200ms — even though the underlying work may continue asynchronously, we tell the client whether the payment succeeded synchronously by waiting on the orchestrator's first decisive result.

Solves: isolating the synchronous request/response contract from the multi-step workflow. The API server's job is "answer the client cleanly"; the saga's job is "finish the work durably". Splitting them lets each scale on its own dimension.

Idempotency Cache (Redis)

Before starting any workflow, the API server runs SETNX idempotency:<key> lock with a 24-hour TTL. First time the key is seen, it gets the lock and proceeds. Second time, Redis says "key already exists" and the server returns the original cached response. The cache is the fast path; the database UNIQUE constraint on Transaction.idempotency_key is the slow but bulletproof backup.

Solves: double-charge on retry. Without this, every retry is a fresh charge. With it, retries are safe — you can build clients that retry aggressively without ever fearing a duplicate.

Payment Orchestrator (Temporal)

The brain. A Temporal workflow that codifies the payment saga as a sequence of steps: fraud-check → authorize → capture → write-ledger → notify-merchant. Temporal persists the workflow state at every step boundary, automatically retries transient failures, and runs compensating actions in reverse if a step permanently fails. If the orchestrator pod dies mid-workflow, another picks up exactly where it left off — no work is lost, no work is duplicated.

Solves: the partial-failure leak from Pass 1. Without an orchestrator, your app server crashes between "card charged" and "ledger updated" leave money in limbo. With Temporal, the workflow resumes after the crash and either finishes the work or undoes it.

Payment Gateway Adapter

A thin wrapper around Stripe, Adyen, or whichever upstream processor we use. It exposes a uniform internal API (authorize, capture, refund) so the orchestrator does not need to know which gateway it's talking to. This is also where we do gateway-specific retries with exponential backoff and circuit-breaker logic — if Stripe is timing out, we fail fast instead of pile-driving more requests.

Solves: vendor lock-in and gateway outages. Without an adapter layer, Stripe-specific code is sprinkled through the codebase — making "fail over to Adyen on Stripe outage" a multi-month project. With it, we change one adapter implementation.

Ledger Service

The most carefully-guarded service in the system. Its only job is to accept a transaction and write the corresponding double-entry ledger rows in a strict-serializable Postgres transaction. Validates the audit invariant (sum of entries = 0) before committing. Refuses to write anything that violates double-entry. Single-writer per partition for max consistency.

Solves: the source of truth for all money. Without a dedicated ledger service, ledger writes happen scattered across business-logic code paths — meaning a single bug in any caller can violate the audit invariant. With a dedicated service, every write goes through one validated path.

Ledger DB (Postgres serializable)

Postgres in SERIALIZABLE isolation level, with synchronous replication to a hot standby in another availability zone. Append-only LedgerEntry table partitioned by account_id (§14). Every transaction is a single Postgres TX that writes 2+ rows atomically — either all entries land or none do, and the database guarantees this even under crash.

Solves: ACID for money. A NoSQL store with eventual consistency would let us briefly observe a state where the debit landed but the credit hadn't — and a balance read in that window would lie. Postgres serializable says: no, you cannot ever observe such a state.

Account Snapshot Cache

Computing SUM(amount_cents) FROM ledger_entry WHERE account_id = ? is slow once an account has a million entries. We materialize a periodic snapshot — every N minutes a job rolls up the ledger into a per-account balance row, and reads hit the snapshot first. If a snapshot is stale, we read the snapshot plus the small delta of entries since the snapshot timestamp. Truth is still the ledger; this is just a fast cached read.

Solves: read latency on hot accounts. Without snapshots, fetching the platform escrow balance — a single account that touches every transaction — would scan billions of rows. With snapshots, it's an O(1) lookup plus a tiny tail.

Fraud Detection

Two-stage gate the orchestrator runs before authorizing the card. Stage 1: real-time deterministic rules — velocity checks (this card just tried 10 charges in 60 seconds), IP blacklists, BIN risk. Sub-100ms, blocks the obvious. Stage 2: an ML model (gradient-boosted trees scoring features like amount-vs-customer-history, geo mismatch, device fingerprint) runs in ~100ms and either approves, declines, or flags for manual review.

Solves: chargebacks. A chargeback costs us the disputed amount plus a $25 fee and counts against our gateway's risk score. Stripe will throttle or terminate accounts whose chargeback rate exceeds 1%. Catching even 50% of fraud before authorization pays for the fraud team many times over.

Notification Service

The fan-out for downstream effects. Sends Sarah's confirmation email, fires the merchant webhook (POST merchant.example.com/webhook with the payment payload), pushes a real-time event to the merchant dashboard. Critically, this is async — the orchestrator queues notifications and moves on; if Zappos's webhook endpoint is slow, Sarah's checkout doesn't wait. Webhooks retry with exponential backoff for 3 days.

Solves: coupling latency. Without async notifications, every checkout's p99 includes the slowest merchant webhook in the system. With them, the orchestrator commits the ledger and returns success in ~500ms regardless of how slow the merchant's server is.

Reconciliation Service

A daily batch job. Pulls Stripe's settlement statement (every charge they processed for us yesterday) and joins it against our internal ledger. Every Stripe row should match exactly one of our LedgerEntry rows; every one of our captured payments should match exactly one Stripe row. Discrepancies get flagged for a human and must be resolved within 24 hours per regulator expectations.

Solves: the "did we actually move the money we think we moved" question. Bugs happen. Network glitches happen. Reconciliation is how we catch them within a day instead of discovering at year-end audit that we've been off by $50K for six months.

Audit Log (S3 append-only)

Every state transition in the system — workflow started, fraud checked, card authorized, ledger committed, notification sent — emits an immutable event record to S3 with object-lock turned on. The bucket is configured so even an admin with root credentials cannot delete or modify written objects. Mirrored to a separate DB for fast queries.

Solves: regulatory and forensic requirements. After any incident, we need to reconstruct exactly what happened. After any audit, regulators need proof that records cannot have been tampered with. S3 object-lock plus append-only is the cheapest, most defensible answer.

Concrete walkthrough — Sarah buys $89 sneakers at 14:02:06

Two scenarios, mapped to the numbered components: the happy path, and a network drop mid-workflow that demonstrates the saga's compensating logic.

✅ Happy path — everything works

  1. Client SDK ① in Sarah's browser tokenizes her Visa, generates idempotency key XYZ-abc-123, POSTs to our API.
  2. API Gateway ② validates auth, rate-limits, forwards to Payment API Server ③.
  3. API server checks Redis ④ — SETNX succeeds (new key), starts a Temporal workflow on the Orchestrator ⑤ with the request as input.
  4. Orchestrator step 1: call Fraud Detection ⑩ — score 0.05, approve.
  5. Orchestrator step 2: call Gateway Adapter ⑥ → Stripe with authorize($89) → success, auth_id=auth_99.
  6. Orchestrator step 3: Gateway Adapter calls capture(auth_99) on Stripe → success, money is on the wire.
  7. Orchestrator step 4: call Ledger Service ⑦ to write two entries — -$89 from Sarah's payment-method account, +$89 to Zappos's escrow account — in a single Postgres TX on Ledger DB ⑧. Snapshot ⑨ refreshed lazily.
  8. Orchestrator step 5: enqueue notifications on Notification Service ⑪ — confirmation email to Sarah, webhook to Zappos.
  9. API server returns 200 { id: pay_..., status: succeeded } to Sarah's browser. Total elapsed: ~600ms. Every state transition logged to Audit Log ⑬.

⚠️ Failure path — network drop after capture, before ledger write

  1. Steps 1-6 run as above. Sarah's card has been captured for $89 — real money has moved.
  2. Right before step 7, the orchestrator pod hosting our workflow loses its network connection.
  3. Temporal's heartbeat fails; the workflow is rescheduled on a different pod within ~10s.
  4. The new pod resumes the workflow at step 7 — not from scratch. Temporal knows steps 1-6 already completed.
  5. Step 7 retries: write ledger entries. Succeeds this time. Workflow continues.
  6. Sarah's browser, which timed out at 30s, hits "Buy" again with the same idempotency key XYZ-abc-123. API server sees the key in Redis, returns the in-flight workflow's eventual result. No second charge, no second ledger entry.
  7. Alternate path: if step 7 keeps failing (say, ledger DB outage), Temporal eventually escalates the workflow to its compensating actions: call Stripe with refund(capture_99), undoing the $89 capture. Sarah sees a "payment failed" error; her card is not charged. No money lost, no money duplicated.
So what: the architecture exists because money is allergic to "almost". Idempotency means retries are free. The saga means partial failures are recoverable. The double-entry ledger means we can prove every cent. Take any one of those out and you ship a system that, on its first bad Tuesday, charges someone twice or loses $89 into the void — and you find out about it a month later from an angry customer or, worse, a regulator.
Step 7

Idempotency — The Foundational Property

If you take only one idea from this page, take this one. Every other guarantee in the system depends on idempotency working correctly.

An operation is idempotent if executing it twice has the same effect as executing it once. SET balance = 100 is idempotent (no matter how many times you run it, balance = 100). balance = balance + 89 is not idempotent — running it twice charges $178. The entire point of the idempotency-key contract is to make non-idempotent operations look idempotent to the caller, so retries are safe.

How idempotency keys work end-to-end

sequenceDiagram participant CL as Client SDK participant API as Payment API participant R as Redis participant DB as Postgres participant ORC as Orchestrator Note over CL: Generate UUID once, before request CL->>API: POST /v1/payments
Idempotency-Key: XYZ-abc-123 API->>R: SETNX idempotency:XYZ-abc-123
TTL=24h alt Key is new R-->>API: OK (lock acquired) API->>DB: INSERT Transaction(idem_key=XYZ-abc-123)
UNIQUE constraint DB-->>API: OK API->>ORC: Start workflow ORC-->>API: Result API->>R: SET idempotency:XYZ-abc-123 = result API-->>CL: 200 + result else Key already seen R-->>API: EXISTS — return cached result API-->>CL: 200 + cached result (no re-execute) end

Two layers of defense: Redis is the fast path (1ms lookup), Postgres UNIQUE constraint on Transaction.idempotency_key is the bulletproof backup. Even if Redis is wiped, the database INSERT will fail with a unique-violation error on the duplicate, and our error handler reads the existing row and returns the original result. There is no race condition where a duplicate gets through both layers.

Rules every idempotency-aware service must follow

✅ Do

  • Generate the key on the client side before the first attempt — so all retries reuse it
  • Use a UUID v4 — sufficient entropy, no coordination needed
  • Store the full response, not just "yes done" — so retries see the same payload as the original
  • Tie the key to the exact request body — if the body differs, treat it as a different request and reject
  • TTL the key (24h is typical) — keys aren't kept forever

❌ Don't

  • Generate the key on the server — defeats the purpose; client's retry would generate a new key
  • Use the request body hash as the key — if the user genuinely wants to charge twice for the same amount, you've blocked them
  • Allow same key with different bodies — that's a bug, not an idempotent retry; return 422
  • Forget to make every step in the workflow itself idempotent — Stripe's API also takes idempotency keys; pass them through
End-to-end idempotency is a chain. The client passes a key to our API; our API passes a derived key to Stripe; the orchestrator's ledger-write step uses a third derived key. If any link in the chain is non-idempotent, retries can produce duplicates somewhere in the system. Audit each integration explicitly — Stripe takes Idempotency-Key headers, Postgres takes UNIQUE constraints, internal services take UUIDs in the request body.
Step 8

Double-Entry Ledger

The ledger is the single most important piece of the system, and the idea behind it is 700 years old. Italian merchants in the 1400s figured out that if you record every transaction as two equal-and-opposite entries, errors become detectable and money becomes traceable. Modern payment systems are doing the same thing, just with Postgres instead of leather-bound books.

The mechanic

Every transaction generates at least two LedgerEntry rows. One DEBITs an account, one CREDITs another, and the amounts sum to zero. Sarah's $89 sneaker purchase looks like this:

flowchart LR TX["Transaction tx_001
Sarah buys sneakers $89"] TX --> E1["Entry A — DEBIT
account: sarah_card
amount: −$89"] TX --> E2["Entry B — CREDIT
account: zappos_escrow
amount: +$89"] E1 --> SUM["SUM = $0 ✓
audit invariant holds"] E2 --> SUM style TX fill:#171d27,stroke:#e8743b,color:#d4dae5 style E1 fill:#171d27,stroke:#e05252,color:#d4dae5 style E2 fill:#171d27,stroke:#38b265,color:#d4dae5 style SUM fill:#171d27,stroke:#38b265,color:#d4dae5

The two entries land in the same Postgres transaction, so either both commit or neither does. After the commit: Sarah's payment-method account is $89 lower, Zappos's escrow is $89 higher, and the system as a whole has the same total amount of money it had before.

Why this beats a simple balance column

📊 Auditability

Every cent ever moved is a row. You can answer "what was Zappos's balance at 14:02:05" by summing entries up to that timestamp. With a single balance column, that history is gone the moment the next transaction overwrites it.

🔐 The mathematical invariant

SUM(amount_cents) FROM ledger_entry GROUP BY account_id gives every account's balance. SUM(amount_cents) FROM ledger_entry across all accounts must equal zero. If it doesn't, money was created or destroyed — and you have a P0 incident regardless of which row caused it.

⚡ No hot-row contention

Updating Zappos's balance with UPDATE merchant SET balance = balance + 89 serializes every Zappos transaction on one row. Inserting a new ledger entry serializes nothing — Postgres can append in parallel. Black Friday goes from "spinner of death" to "actually responsive".

🔄 Reversals are clean

Refunding Sarah doesn't UPDATE or DELETE the original entries. It writes new entries: +$89 to her account, -$89 from Zappos's escrow, with a reference to the original transaction. The original history is preserved; the reversal is a separate audit event.

More complex example — split payment with platform fee

Sarah pays $100 for a service. The platform takes a 3% fee. Three ledger entries, sum still zero:

EntryAccountTypeAmount
1sarah_cardDEBIT−$100.00
2provider_escrowCREDIT+$97.00
3platform_revenueCREDIT+$3.00
Sum$0.00 ✓
The on-call sanity check: a payment system is healthy when SELECT SUM(amount_cents) FROM ledger_entry returns zero. We run that query as a Datadog metric every minute. If it ever drifts, an alarm fires before the next transaction even completes — because something is fundamentally broken and every additional transaction makes it worse.
Step 9

Saga Pattern with Temporal

A payment is not an atomic operation — it spans multiple systems we don't control. We can't take a global lock across Stripe, our DB, and the merchant's webhook endpoint. The saga pattern is how we get atomicity-like guarantees without distributed transactions.

The saga as code (Temporal workflow)

payment workflow — pseudocode
workflow processPayment(req):
  // each step is an "activity" — auto-retried, persistent
  fraudResult = checkFraud(req)
  if fraudResult.declined: return failure("fraud")

  authId = authorize(req.amount, req.payment_method_id)   // step 1
  try:
    captureId = capture(authId)                             // step 2
    try:
      writeLedger(req.amount, req.source, req.dest)        // step 3
      try:
        notifyMerchant(req)                                // step 4 — async, best-effort
      catch:
        // step 4 failure does NOT roll back; webhooks retried separately
        log("webhook will retry")
    catch e:
      refund(captureId)                                    // compensate step 2
      throw
  catch e:
    reverseAuth(authId)                                    // compensate step 1
    throw

  return success

What Temporal gives us for free

💾 Durable state

Every step's input and output is persisted before the next step runs. If the orchestrator pod dies between step 2 and step 3, a different pod resumes at step 3 — never re-running step 2 (which already moved real money).

🔁 Auto-retry with backoff

Transient failures (Stripe 502, network blip, DB timeout) are retried with exponential backoff up to a configured max. Permanent failures (declined card, validation error) escalate immediately to compensating actions.

🧯 Compensating actions

Each step has an inverse. If we capture a charge but can't write the ledger, the orchestrator runs refund against Stripe to undo the capture. Sarah sees an error; her card is not charged. No half-completed state survives.

Saga vs. 2-Phase Commit (2PC)

2PC is the classical answer to multi-system atomicity: a coordinator asks every participant "can you commit?", then if all say yes, says "commit". It does not work for us for two reasons:

🚫 Stripe doesn't support 2PC

Card networks have no "prepare-to-commit" stage. Once we tell Stripe to capture, the money moves. You cannot ask the world's payment networks to please pause their transaction until our other systems are ready.

🚫 2PC blocks under coordinator failure

If the 2PC coordinator dies after sending "prepare" but before "commit", participants hold locks indefinitely waiting for the verdict. In a payment context, that means the merchant's account row is locked for hours. Saga has no global lock, so failures degrade gracefully.

Saga vs. 2PC in one line: 2PC tries to give you atomicity by holding everything hostage until everyone agrees; saga gives you atomicity by being willing to undo what's already been done. The first works only inside one DB; the second is the only thing that works across systems you don't own.
Step 10

Fraud Detection

Fraud is the single biggest non-engineering risk to a payment platform. A 1% chargeback rate gets you throttled by Stripe; a 2% chargeback rate gets your account terminated and you lose your business overnight. Catching fraud before the charge is therefore worth a lot of latency budget.

Two-stage gate

Stage 1 — Real-time deterministic rules

Sub-100ms. Cheap, fast, blocks the obvious. Examples:

  • Velocity check — same card seen 10 times in 60 seconds across our system → block
  • IP blacklist — known fraudster IP / Tor exit node / data-center IP
  • BIN risk — the card-issuer prefix tells us if it's a prepaid card from a high-fraud country
  • Geo mismatch — billing address in California, IP in Nigeria, shipping address in Russia
  • Amount threshold — first-ever charge on a brand-new account for $9999 → manual review

Implemented as a Redis-backed counter set + hot rule list, evaluated in <50ms.

Stage 2 — ML scoring

~100ms. Catches the subtle. A gradient-boosted-tree model trained on years of historical fraud data scores every transaction on a 0-1 risk scale based on ~200 features:

  • Customer history — average transaction size, time-since-signup, prior chargebacks
  • Merchant profile — chargeback rate, vertical, whether this is a typical purchase
  • Device signals — fingerprint, browser, OS, language
  • Transaction shape — amount-vs-typical, time-of-day, items in cart

Score > 0.9 → auto-decline. Score 0.5-0.9 → manual review queue. Score < 0.5 → approve.

The latency-vs-precision trade-off: we pay roughly 150ms of fraud-check latency on every transaction, which gets us a chargeback rate near 0.1% (industry-leading). Skipping fraud entirely would let us return responses in 400ms instead of 550ms — and would also kill the business inside a year. There is no useful version of "skip fraud to go faster".
Step 11

Reconciliation

"Trust but verify" is the entire job. Every day, a batch process compares our internal ledger against the source-of-truth statements from Stripe, the card networks, and the banks we settle with. The two views must match to the cent. If they don't, an engineer is paged.

The daily reconciliation pipeline

  1. 02:00 UTC — pull yesterday's settlement statement from Stripe via their API. Lists every charge they processed for us, with their internal IDs and our metadata.
  2. 02:15 UTC — query our LedgerEntry table for every successful payment from yesterday.
  3. 02:30 UTC — full outer join on (payment_id, amount, currency). Three buckets: (a) match on both sides — green; (b) in Stripe but not in our ledger — red, money received but unrecorded; (c) in our ledger but not in Stripe — red, we think we got paid but Stripe disagrees.
  4. 02:45 UTC — non-empty red buckets page the on-call. Issue must be triaged within 24 hours per regulator expectations.

Common discrepancy patterns

⏱️ Timing skew

A transaction captured at 23:59 UTC might land in our ledger today and Stripe's statement tomorrow. Resolved by widening the window — match against today's and yesterday's Stripe statement.

🔄 Async webhook lag

Stripe webhooks for capture-success can arrive after the next-day statement. We use the workflow's own state, not just webhooks, as the truth — webhooks are a hint, not the source.

🐞 Real bug

Rare but real: an idempotency-key collision, a saga that didn't compensate properly, a manual UPDATE someone ran in production. Hands-on-deck investigation; ledger gets corrective entries (never modified) once root cause is found.

The recon dashboard is the heartbeat of trust. Investors, auditors, and regulators all look at the same number: yesterday's reconciliation green-rate. 100% green every day is not a luxury — it is the whole reason a payment platform is allowed to exist.
Step 12

PCI-DSS Compliance

PCI-DSS is the payment-card industry's data-security standard. It is non-optional for anyone handling card data, and the cost of compliance scales sharply with how much of our infrastructure touches card numbers. The goal is therefore to never see a real card number anywhere on our servers.

Tokenization — the single most important PCI scope reducer

sequenceDiagram participant U as Sarah's Browser participant SDK as Stripe.js iframe participant ST as Stripe Vault participant API as Our API participant DB as Our DB U->>SDK: Types card 4242-4242-4242-4242 Note over SDK: Card data NEVER leaves the iframe
Posted directly to Stripe SDK->>ST: Vault the card ST-->>SDK: token = "pm_abc123" SDK->>U: Page receives only token U->>API: POST /v1/payments
{ payment_method_id: "pm_abc123" } Note over API: Backend never sees PAN —
only the opaque token API->>DB: INSERT (..., payment_method_id="pm_abc123") API->>ST: charge using pm_abc123

The Stripe.js iframe is hosted on Stripe's domain, so it doesn't even share the same browsing context as our checkout page. Card data is captured by Stripe, vaulted by Stripe, and we receive an opaque token like pm_abc123 that we can use to charge but cannot reverse-engineer back into a real card number. This single design decision moves our PCI scope from "we are a processor" (PCI Level 1, hundreds of pages of audit) to "we are a tokenized merchant" (PCI SAQ-A, a checklist).

The other PCI guardrails

🔐 Network isolation

Servers that handle payment tokens live in a tightly-firewalled VPC with no inbound internet access. Egress is whitelisted to Stripe's IPs and our other services only. Audit logs all egress for review.

📜 Logged everything, redacted forever

Application logs run through a redaction pipeline that strips anything matching a card-number pattern (Luhn-checkable digit sequences) before writing to the log store. Even if a developer accidentally logs req.body, the PAN never lands on disk.

🔑 Secrets via vault, rotated quarterly

Stripe API keys, signing secrets, and DB credentials are held in HashiCorp Vault or AWS Secrets Manager. Apps fetch at boot, no secrets in env files or git. Rotated every 90 days minimum.

📋 Annual audit + quarterly scans

An external QSA (Qualified Security Assessor) audits us yearly. Quarterly ASV (Approved Scanning Vendor) scans probe our public surface for vulnerabilities. Findings are tracked to closure within 30-90 days depending on severity.

The one-line summary an auditor wants to hear: "We don't store, process, or transmit primary account numbers — we tokenize at the client and only handle opaque tokens server-side." If that sentence is true and provable, you've reduced ~80% of the PCI burden.
Step 13

Multi-Currency

Sarah pays $89 USD; the merchant settles in EUR. Naïvely you might think "convert the amount, save the converted number". That breaks the audit invariant — the FX spread has to live somewhere too, and the conversion rate at transaction time has to be locked or you can't reconcile.

The rule — entries are always in the account's native currency

An account has a fixed currency. sarah_card is USD. zappos_escrow_eur is EUR. A ledger entry's amount is always denominated in that account's currency. FX conversion is itself a transaction with multiple legs — and the spread is credited to a platform revenue account.

Sarah pays $89 USD, merchant settles EUR at rate 1 USD = 0.92 EUR

#AccountCurrencyAmountNote
1sarah_cardUSD−$89.00Card debited
2platform_fx_pool_usdUSD+$89.00USD enters platform
3platform_fx_pool_eurEUR−€81.88EUR leaves platform pool
4zappos_escrow_eurEUR+€81.88Merchant credited at 1 USD = 0.92 EUR
5platform_fx_revenue_eurEUR+€0.00Spread / margin (if applicable)

Each currency's entries sum to zero on their own — USD entries (1, 2) sum to zero, EUR entries (3, 4, 5) sum to zero. The platform takes on the FX risk; we hedge by holding currency pools and rebalancing them periodically with our banking partners.

The locked-rate rule: the FX rate used for entries 3 and 4 is captured at transaction time, written into the Transaction row, and never recalculated. If the rate later moves, that's our P&L — but the ledger never changes. This is what makes multi-currency reconcilable.
Step 14

Data Partitioning

1 billion ledger rows per year does not fit on one box once we factor in indices, replicas, and operational headroom. We partition the LedgerEntry table by account_id.

Why account_id, not transaction_id or time

✅ Account_id (chosen)

Most queries are "give me all entries for account X" — balance lookup, statement generation, audit. Sharding by account_id keeps an account's full history co-located on one shard, so SUM queries are local and fast.

❌ Transaction_id

Spreads each transaction's debit and credit entries across different shards — meaning every transaction commit becomes a distributed write. Atomic ledger writes become hard.

❌ Time

Today's shard is hot; yesterday's is cold. Hot shard becomes the bottleneck. Better used as a secondary partition (sub-partition by month within each account-shard) for archival.

Hot accounts — the platform escrow problem

Most accounts are tiny — a customer might have a few transactions per year. But the platform escrow account touches every single transaction we process. At 1M tx/day, escrow has 2M+ entries/day on a single shard. That shard becomes a write bottleneck.

Solution: sub-shard hot accounts. The platform escrow is virtually represented as N "shards" (escrow_001, escrow_002, …, escrow_032). Writes are randomly assigned to one of the N. Reads roll up across all N. The balance is SUM over the sub-shards. Net effect: we trade a tiny amount of read complexity for 32× write throughput on the hot account.

Partition by access pattern, not data structure. Account_id is the right key because that's how we query. If our access pattern were "give me yesterday's transactions across all accounts" (e.g., for reconciliation), time-partitioning would win. Pick the partition key by looking at the queries, not the schema.
Step 15

Fault Tolerance

Every component in the system can fail. The interesting question is: does that failure cause money to be lost, money to be duplicated, or just a temporary outage? The first two are unacceptable; the third is recoverable. The architecture is built so that every plausible failure mode lands in bucket three.

What failsWhat happensHow we recover
Payment API pod crashes mid-requestClient times out, retries with same idempotency keyRetry hits Redis or DB unique-constraint, returns original result
Orchestrator pod dies between stepsWorkflow pausedTemporal reschedules workflow on a healthy pod, resumes at next step
Stripe gateway has 30-min outageNew payments fail on authorize stepCircuit breaker fails fast; client sees error; no money moved; saga not started
Stripe drops mid-captureWorkflow doesn't know if capture succeededIdempotent retry to Stripe; if still uncertain, query Stripe API for capture status
Ledger DB primary loses a diskSync replica promoted; ~30s of write blockagePostgres synchronous replication; orchestrator retries blocked writes
Redis idempotency cache wipedFirst retry would re-executeDB UNIQUE constraint on Transaction.idempotency_key catches it; original row read and returned
Notification service downWebhooks not deliveredAsync retry with exponential backoff for 3 days; payment itself unaffected
Whole AZ goes down~30% of capacity lostMulti-AZ deployment: traffic shifted to remaining AZs; sync replicas promoted; degraded for <5min
The mental model: failures are loud, not silent. Every failure surface is either (a) auto-retried by Temporal, (b) blocked by an idempotency-key, (c) caught by reconciliation the next morning, or (d) raised as an alert. There is no quiet failure path where money disappears and nobody knows for a week.
Step 16

Interview Q&A

How do you ensure the same payment isn't charged twice if the user double-clicks?
Idempotency keys, two layers of defense. The client SDK generates a UUID before the first request and reuses it on every retry. Server-side: Redis SETNX idempotency:<key> on the fast path, plus a UNIQUE constraint on Transaction.idempotency_key in Postgres as the bulletproof backup. Even if Redis is wiped, the DB rejects the duplicate insert and we return the original cached response. The client can mash "Buy" 100 times — only the first request runs, the rest return the same result.
What happens if we capture the card but our DB write fails?
Temporal saga compensating action. The orchestrator persisted "capture succeeded" before attempting the ledger write. When the ledger write fails permanently (after retries), the saga runs the compensating action — refund(captureId) against Stripe — undoing the capture. Sarah sees a payment-failed error; her card is not charged. The audit log records all five state transitions: authorized → captured → ledger-failed → refund-issued → refund-confirmed. Money is never lost or duplicated; only briefly in flight.
Why double-entry ledger over a simple balance column?
Auditability, the math invariant, and zero hot-row contention. A balance column overwrites history (you can't answer "what was the balance at 2pm yesterday?"). Double-entry is append-only — every cent's movement is a row, queryable forever. The mathematical invariant SUM(amount_cents) FROM ledger_entry = 0 globally is a free correctness check; if it ever drifts from zero, money has been created or destroyed and we can detect it within a minute. And appending parallel rows scales infinitely; updating one balance row serializes everything on Postgres row locks.
How would you build the refund flow?
It's a brand-new transaction, not a mutation of the original. A refund request creates a new Transaction with its own idempotency key, a reference to the original payment, and writes new ledger entries that are equal-and-opposite to the original (debit Zappos's escrow, credit Sarah's payment-method-account, both for $89). The orchestrator then calls Stripe's refund API. Original transaction stays in the ledger forever — refund is recorded as a separate event with full audit trail. Partial refunds work the same way with a smaller amount.
How does a saga differ from a 2-phase commit?
2PC takes locks; saga takes responsibility. 2PC asks every participant "can you commit?", holds locks, then says "commit". This requires every participant to support the protocol — and Stripe / Visa / Mastercard do not. 2PC also blocks indefinitely if the coordinator dies after "prepare". Saga gives up trying to be atomic, instead defining a compensating action for every step. When something fails, the saga undoes the partial work in reverse. Saga works across systems we don't own; 2PC works only inside one DB.
How do you keep PCI scope small?
Tokenize at the client, never see the PAN server-side. Stripe.js (a Stripe-hosted iframe) collects the card number directly in the browser and ships it to Stripe's vault, returning an opaque pm_... token. Our backend only ever stores tokens. We don't store, process, or transmit primary account numbers — which moves us from PCI Level 1 (full audit) to PCI SAQ-A (a checklist). Backed up by network isolation, secret rotation, log redaction, quarterly ASV scans, and an annual QSA audit.
How do you handle a Stripe outage mid-transaction?
It depends where in the saga we were. If we hadn't called Stripe yet (before authorize), the circuit breaker fails fast — client sees error, no money moved, no compensation needed. If we'd authorized but not captured, Temporal retries capture with exponential backoff; if Stripe stays down for hours we run the reverse-auth compensating action and tell the client we can't fulfill. If we'd captured but couldn't confirm, we use Stripe's idempotency-key feature to safely re-query — never re-charge. After Stripe recovers, the reconciliation job verifies our internal ledger matches Stripe's settlement to the cent.
How do you scale the ledger to a billion rows per year?
Partition by account_id; sub-shard hot accounts. Most accounts have tiny histories so partitioning by account_id keeps each account's full record co-located for fast SUM queries. The platform escrow account — which touches every single transaction — gets sub-sharded into 32 virtual shards (escrow_01..escrow_32) so writes don't bottleneck on one row. Reads roll up by summing across the sub-shards. Snapshot caches materialize per-account balances every few minutes so balance reads are O(1) instead of O(N) over the entire ledger.
The one-line summary the interviewer remembers: "It's a saga-orchestrated workflow with strict idempotency keys writing into a double-entry ledger — the saga handles partial-failure recovery, the keys make retries safe, and the ledger guarantees that SUM(all entries) = 0 always holds. Every other component exists to support those three properties."