Stripe / PayPal style — idempotency, sagas, and a double-entry ledger that never loses a cent. The architecture, the failure modes, and why every box is non-negotiable when real money is on the wire.
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
Sarah is checking out at 14:02:06 on a Tuesday. The cart says one pair of sneakers, $89.00. She taps "Buy". Within two seconds, four things must happen and they must happen together: her credit card gets charged $89, the merchant's escrow balance goes up by $89, the inventory drops by one pair, and a confirmation email lands in her inbox. If any of those four steps fails — and the rest succeed — somebody loses money. If we charge her card but never tell the merchant, Sarah paid for sneakers that won't ship. If we tell the merchant but the charge fails, the merchant ships sneakers we never collected for. If Sarah retries because her Wi-Fi blinked, we must not charge her twice.
That's what a payment system is: a piece of software whose entire purpose is moving money between three parties — the buyer's bank, the merchant's bank, and the platform's escrow account — without ever creating, destroying, or duplicating a single cent. Stripe, PayPal, Adyen, Razorpay all solve this same problem at scale.
Before drawing a single box, pin down what the system must do — and explicitly what it does not. In an interview, asking these questions is half the score.
Numbers are not optional in HLD. They drive sharding, ledger sizing, and how big our database needs to be. Let's pick a Stripe-like mid-size scale.
Assume 1 million transactions per day, average ticket size $50, peak load is Black Friday at roughly 40× the daily average concentrated into a few hours.
~12 TPS
1M / 86400
~500 TPS
Black Friday spike
~$50M
1M × $50 avg
p99 < 1s
End-to-end checkout
Each transaction record (transaction + payment + ledger entries + audit log) is roughly 1 KB across the relevant tables. 1M × 1KB = 1 GB/day. Over 5 years (the typical regulatory retention window for financial records): ~2 TB total. With indices, replicas, and audit logs: ~6 TB provisioned.
Every transaction writes at least 2 ledger entries (debit + credit), often more for split payments and FX. Realistic average is 3 entries per transaction. 1M × 3 = 3M ledger rows/day = ~1B rows/year. This is the dominant table by row count and drives the partitioning strategy in §14.
| Metric | Value | Why it matters |
|---|---|---|
| Avg TPS | 12/s | Drives base sizing — small box can do this |
| Peak TPS | 500/s | Drives autoscale ceilings & rate-limit budgets |
| GMV / day | $50M | Defines the cost-of-error — every minute of downtime is ~$35K |
| 5-yr storage | 2-6 TB | Forces partitioning of the ledger by account_id |
| Ledger rows / yr | ~1 B | Append-only — no UPDATE, ever; only INSERT |
Four mutating endpoints carry the bulk of the value, plus webhooks for the async events the merchant cares about. Note the Idempotency-Key header is mandatory on every mutation — this is the contract that lets clients retry safely.
// Create a payment — mutation, requires Idempotency-Key POST /v1/payments Headers: { "Idempotency-Key": "1f3b9c2a-..." } { "amount": 8900, // in cents "currency": "USD", "payment_method_id": "pm_abc123", // Stripe-style token, no PAN "customer_id": "cus_sarah42", "merchant_id": "mer_zappos99", "description": "Order #5512 — sneakers" } → 201 Created { "id": "pay_...", "status": "succeeded", ... } // Refund — mutation, requires Idempotency-Key POST /v1/refunds Headers: { "Idempotency-Key": "..." } { "payment_id": "pay_...", "amount": 8900, "reason": "requested_by_customer" } → 201 Created // Read a payment — safe, no key needed GET /v1/payments/:id → 200 OK { ... } // Payout to a merchant's bank — mutation, requires Idempotency-Key POST /v1/payouts Headers: { "Idempotency-Key": "..." } { "merchant_id": "mer_zappos99", "amount": 50000, "currency": "USD" } → 201 Created // Webhooks — async events fired to merchant's HTTPS endpoint POST → merchant.example.com/webhook { "type": "payment.succeeded", "data": { ... } } // Retried with exponential backoff for up to 3 days on non-2xx
Idempotency-Key is mandatory on every mutation: Sarah's phone might lose signal right after sending the request. Her browser auto-retries. Without an idempotency key, the second request is indistinguishable from a fresh charge, and Sarah pays $178 for one pair of sneakers. With the key, the second request is recognized as a replay, the original result is returned, and she pays $89 once. The key is a UUID generated by the client and included in the request header — analogous to a receipt number on a paper invoice.pm_... token. Our backend only ever sees the token. This is the single biggest PCI-DSS scope reducer (more in §12).A payment system has four core tables, and the relationships matter enormously. Account represents any party that holds a balance — customer, merchant, the platform's own escrow, the platform's revenue. Transaction is a single business event ("Sarah pays $89 to Zappos"). Payment attaches the payment-method specifics. LedgerEntry is the append-only double-entry record — the source of truth for every cent. We will use Postgres with strict serializability here; the trade-offs against NoSQL are covered in §15.
idempotency_key UNIQUE on TransactionThe single most important constraint in the schema. Even if the Redis idempotency cache is down or evicted, the database will reject a duplicate insertion with a primary-key violation. This is your last line of defense against double-charges, and it is enforced at the storage layer not the application — meaning even a buggy application server cannot bypass it.
LedgerEntry is append-onlyNo UPDATE, no DELETE. Ever. Reversing a transaction does not delete the original entries — it adds new entries that compensate. This gives us a perfect audit trail: the entire history of every account is reconstructable by replaying the ledger from the first day. Regulators love this; engineers learn to love it after their first incident.
balance_cents as a derived snapshotThe balance on Account is not the source of truth — it's a cached snapshot derived from SUM(amount_cents) FROM ledger_entry WHERE account_id = ?. The ledger is truth; the balance is convenience. This is what makes the system reconcilable.
A transaction is a business event ("$89 from Sarah to Zappos"); a payment is the mechanism ("Visa card ending 4242, auth code XYZ, captured at 14:02:06"). One transaction has exactly one payment, but separating them lets us swap the payment mechanism (card → wallet → bank transfer) without changing the business semantics.
SUM(amount_cents) FROM ledger_entry across all accounts must equal zero too. If that sum ever diverges from zero, money has been created or destroyed in our system — that is a P0 incident that wakes up the on-call engineer.This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it shatters the moment real money flows through it, and the production shape where every box exists to plug a specific failure mode.
One app server. It receives the checkout POST, charges the card via Stripe's API, then in the same request handler updates the merchant's balance in the DB, then sends a confirmation email, then returns 200 to the browser. Three calls in series, one happy path.
Four catastrophic failure modes show up the moment this hits production:
Stripe charged Sarah's card for $89. Our app server crashed before writing the merchant balance. Sarah's bank statement shows the charge. Our DB shows nothing. The merchant has no idea Sarah paid and never ships sneakers. Sarah is angry, Zappos is angry, and nobody can find the $89 — it's sitting in our Stripe account with no internal record. Money has effectively leaked out of the system.
Browser timed out after step 1, Sarah hits "Buy" again. Our app receives a fresh request that looks identical. We run the whole flow again — Stripe charges Sarah twice. Now $178 has left her account for one pair of sneakers. She files a chargeback, the bank yanks back $89 plus a $25 dispute fee, and we still owe Zappos for the sneakers they shipped.
Black Friday: Zappos receives 500 orders per second. Every order does UPDATE merchant SET balance = balance + amount WHERE id = 'zappos'. Postgres serializes these updates on the same row. Lock queue grows, p99 latency climbs from 50ms to 5 seconds, and Sarah sees a spinner on her checkout page.
A single UPDATE balance overwrites history. We can never answer "what was Sarah's balance at 14:02:05?" or "did this charge actually happen?" Regulators want a ledger of every state transition. Without one, we cannot pass an audit and we cannot debug incidents — we can only guess.
The production design is built on three ideas. Each one solves exactly one of the failure modes above. Get these three right and the architecture writes itself.
Every mutation request carries a UUID generated by the client — a receipt number. The server keeps a record of every key it has seen. If the same key shows up twice, the server returns the original result instead of executing the operation again. Same as a hotel handing back the same room key when you re-show the same booking confirmation — they don't re-book the room.
Solves: double-charge on retry. Sarah can mash the Buy button 100 times — the first request runs, the next 99 return the same result, no extra money moves.
A payment is not one operation — it is a workflow: authorize, then capture, then ledger update, then notify. Each step has a paired compensating action (auth → reverse-auth, capture → refund, ledger → reversing entry). An orchestrator drives the workflow, retrying transient failures, and on permanent failure runs the compensating actions in reverse to undo the partial work. Like a recipe with explicit "if you've already cracked the eggs but ran out of flour, throw the eggs out" instructions.
Solves: partial-failure leaks (Pass 1 problem #1). No state where the card was charged but the merchant wasn't credited.
Every transaction writes two entries that sum to zero — a debit on one account and a credit on another. Money is never created or destroyed; only moved. The system's invariant is mathematical: SUM(all ledger entries) = 0. Borrowed directly from 700-year-old accounting practice, because accountants solved this problem long before computers existed. Reconciliation becomes a single SQL query.
Solves: hot-row contention (Pass 1 problem #3) and audit gaps (Pass 1 problem #4). Append-only, no UPDATE locks; balances computed on demand from append history.
Crucially, these three ideas compose. Idempotency makes individual saga steps safe to retry. The ledger gives the saga's compensating actions a clean way to record reversals. The saga orchestrator commits ledger writes in transactions to guarantee step atomicity. Take any one out and the other two break — that is why all three appear in every serious payment system on the planet.
Now the full picture. Every node is numbered ①–⑬ — find its matching card below for what it does and what would break without it. The architecture is split into four planes by responsibility: Ingest (accept the request), Orchestration (drive the workflow), Ledger (record the truth), and Risk (decide whether to allow the transaction).
Use the numbers in the diagram to find the matching card below. Each one answers what is it, why is it here, and what would break without it.
The Stripe.js or PayPal SDK loaded in Sarah's browser or mobile app. It does two critical jobs: (1) it tokenizes her card by posting the raw PAN directly to Stripe's hosted iframe, never to our servers — getting back a token like pm_abc123 that our backend can use without ever touching real card data. (2) It generates a fresh UUID for the Idempotency-Key header, locking in the receipt number before the user even clicks "Buy".
Solves: two huge problems at once — keeps PCI scope off our servers (§12) and gives us the idempotency primitive we depend on for safe retries.
The first thing inbound traffic hits. Terminates TLS, enforces rate limits per API key (e.g., a customer can't fire 1000 charges/sec from the same key), validates auth tokens, and forwards clean requests to the payment API server. AWS API Gateway, Kong, or Envoy all fit.
Solves: a misbehaving or malicious client trying to brute-force a fraud attempt. Without the gateway, every bad actor's request reaches our application logic — wasting CPU and risking DB exhaustion. With it, 99% of abuse is rejected at the edge.
Stateless service. Validates the request body, looks up the customer and merchant, checks the idempotency key in Redis ④, and if it's a new key kicks off a workflow on the Orchestrator ⑤. Returns a synchronous response within ~200ms — even though the underlying work may continue asynchronously, we tell the client whether the payment succeeded synchronously by waiting on the orchestrator's first decisive result.
Solves: isolating the synchronous request/response contract from the multi-step workflow. The API server's job is "answer the client cleanly"; the saga's job is "finish the work durably". Splitting them lets each scale on its own dimension.
Before starting any workflow, the API server runs SETNX idempotency:<key> lock with a 24-hour TTL. First time the key is seen, it gets the lock and proceeds. Second time, Redis says "key already exists" and the server returns the original cached response. The cache is the fast path; the database UNIQUE constraint on Transaction.idempotency_key is the slow but bulletproof backup.
Solves: double-charge on retry. Without this, every retry is a fresh charge. With it, retries are safe — you can build clients that retry aggressively without ever fearing a duplicate.
The brain. A Temporal workflow that codifies the payment saga as a sequence of steps: fraud-check → authorize → capture → write-ledger → notify-merchant. Temporal persists the workflow state at every step boundary, automatically retries transient failures, and runs compensating actions in reverse if a step permanently fails. If the orchestrator pod dies mid-workflow, another picks up exactly where it left off — no work is lost, no work is duplicated.
Solves: the partial-failure leak from Pass 1. Without an orchestrator, your app server crashes between "card charged" and "ledger updated" leave money in limbo. With Temporal, the workflow resumes after the crash and either finishes the work or undoes it.
A thin wrapper around Stripe, Adyen, or whichever upstream processor we use. It exposes a uniform internal API (authorize, capture, refund) so the orchestrator does not need to know which gateway it's talking to. This is also where we do gateway-specific retries with exponential backoff and circuit-breaker logic — if Stripe is timing out, we fail fast instead of pile-driving more requests.
Solves: vendor lock-in and gateway outages. Without an adapter layer, Stripe-specific code is sprinkled through the codebase — making "fail over to Adyen on Stripe outage" a multi-month project. With it, we change one adapter implementation.
The most carefully-guarded service in the system. Its only job is to accept a transaction and write the corresponding double-entry ledger rows in a strict-serializable Postgres transaction. Validates the audit invariant (sum of entries = 0) before committing. Refuses to write anything that violates double-entry. Single-writer per partition for max consistency.
Solves: the source of truth for all money. Without a dedicated ledger service, ledger writes happen scattered across business-logic code paths — meaning a single bug in any caller can violate the audit invariant. With a dedicated service, every write goes through one validated path.
Postgres in SERIALIZABLE isolation level, with synchronous replication to a hot standby in another availability zone. Append-only LedgerEntry table partitioned by account_id (§14). Every transaction is a single Postgres TX that writes 2+ rows atomically — either all entries land or none do, and the database guarantees this even under crash.
Solves: ACID for money. A NoSQL store with eventual consistency would let us briefly observe a state where the debit landed but the credit hadn't — and a balance read in that window would lie. Postgres serializable says: no, you cannot ever observe such a state.
Computing SUM(amount_cents) FROM ledger_entry WHERE account_id = ? is slow once an account has a million entries. We materialize a periodic snapshot — every N minutes a job rolls up the ledger into a per-account balance row, and reads hit the snapshot first. If a snapshot is stale, we read the snapshot plus the small delta of entries since the snapshot timestamp. Truth is still the ledger; this is just a fast cached read.
Solves: read latency on hot accounts. Without snapshots, fetching the platform escrow balance — a single account that touches every transaction — would scan billions of rows. With snapshots, it's an O(1) lookup plus a tiny tail.
Two-stage gate the orchestrator runs before authorizing the card. Stage 1: real-time deterministic rules — velocity checks (this card just tried 10 charges in 60 seconds), IP blacklists, BIN risk. Sub-100ms, blocks the obvious. Stage 2: an ML model (gradient-boosted trees scoring features like amount-vs-customer-history, geo mismatch, device fingerprint) runs in ~100ms and either approves, declines, or flags for manual review.
Solves: chargebacks. A chargeback costs us the disputed amount plus a $25 fee and counts against our gateway's risk score. Stripe will throttle or terminate accounts whose chargeback rate exceeds 1%. Catching even 50% of fraud before authorization pays for the fraud team many times over.
The fan-out for downstream effects. Sends Sarah's confirmation email, fires the merchant webhook (POST merchant.example.com/webhook with the payment payload), pushes a real-time event to the merchant dashboard. Critically, this is async — the orchestrator queues notifications and moves on; if Zappos's webhook endpoint is slow, Sarah's checkout doesn't wait. Webhooks retry with exponential backoff for 3 days.
Solves: coupling latency. Without async notifications, every checkout's p99 includes the slowest merchant webhook in the system. With them, the orchestrator commits the ledger and returns success in ~500ms regardless of how slow the merchant's server is.
A daily batch job. Pulls Stripe's settlement statement (every charge they processed for us yesterday) and joins it against our internal ledger. Every Stripe row should match exactly one of our LedgerEntry rows; every one of our captured payments should match exactly one Stripe row. Discrepancies get flagged for a human and must be resolved within 24 hours per regulator expectations.
Solves: the "did we actually move the money we think we moved" question. Bugs happen. Network glitches happen. Reconciliation is how we catch them within a day instead of discovering at year-end audit that we've been off by $50K for six months.
Every state transition in the system — workflow started, fraud checked, card authorized, ledger committed, notification sent — emits an immutable event record to S3 with object-lock turned on. The bucket is configured so even an admin with root credentials cannot delete or modify written objects. Mirrored to a separate DB for fast queries.
Solves: regulatory and forensic requirements. After any incident, we need to reconstruct exactly what happened. After any audit, regulators need proof that records cannot have been tampered with. S3 object-lock plus append-only is the cheapest, most defensible answer.
Two scenarios, mapped to the numbered components: the happy path, and a network drop mid-workflow that demonstrates the saga's compensating logic.
XYZ-abc-123, POSTs to our API.SETNX succeeds (new key), starts a Temporal workflow on the Orchestrator ⑤ with the request as input.authorize($89) → success, auth_id=auth_99.capture(auth_99) on Stripe → success, money is on the wire.-$89 from Sarah's payment-method account, +$89 to Zappos's escrow account — in a single Postgres TX on Ledger DB ⑧. Snapshot ⑨ refreshed lazily.200 { id: pay_..., status: succeeded } to Sarah's browser. Total elapsed: ~600ms. Every state transition logged to Audit Log ⑬.XYZ-abc-123. API server sees the key in Redis, returns the in-flight workflow's eventual result. No second charge, no second ledger entry.refund(capture_99), undoing the $89 capture. Sarah sees a "payment failed" error; her card is not charged. No money lost, no money duplicated.If you take only one idea from this page, take this one. Every other guarantee in the system depends on idempotency working correctly.
An operation is idempotent if executing it twice has the same effect as executing it once. SET balance = 100 is idempotent (no matter how many times you run it, balance = 100). balance = balance + 89 is not idempotent — running it twice charges $178. The entire point of the idempotency-key contract is to make non-idempotent operations look idempotent to the caller, so retries are safe.
Two layers of defense: Redis is the fast path (1ms lookup), Postgres UNIQUE constraint on Transaction.idempotency_key is the bulletproof backup. Even if Redis is wiped, the database INSERT will fail with a unique-violation error on the duplicate, and our error handler reads the existing row and returns the original result. There is no race condition where a duplicate gets through both layers.
Idempotency-Key headers, Postgres takes UNIQUE constraints, internal services take UUIDs in the request body.The ledger is the single most important piece of the system, and the idea behind it is 700 years old. Italian merchants in the 1400s figured out that if you record every transaction as two equal-and-opposite entries, errors become detectable and money becomes traceable. Modern payment systems are doing the same thing, just with Postgres instead of leather-bound books.
Every transaction generates at least two LedgerEntry rows. One DEBITs an account, one CREDITs another, and the amounts sum to zero. Sarah's $89 sneaker purchase looks like this:
The two entries land in the same Postgres transaction, so either both commit or neither does. After the commit: Sarah's payment-method account is $89 lower, Zappos's escrow is $89 higher, and the system as a whole has the same total amount of money it had before.
balance columnEvery cent ever moved is a row. You can answer "what was Zappos's balance at 14:02:05" by summing entries up to that timestamp. With a single balance column, that history is gone the moment the next transaction overwrites it.
SUM(amount_cents) FROM ledger_entry GROUP BY account_id gives every account's balance. SUM(amount_cents) FROM ledger_entry across all accounts must equal zero. If it doesn't, money was created or destroyed — and you have a P0 incident regardless of which row caused it.
Updating Zappos's balance with UPDATE merchant SET balance = balance + 89 serializes every Zappos transaction on one row. Inserting a new ledger entry serializes nothing — Postgres can append in parallel. Black Friday goes from "spinner of death" to "actually responsive".
Refunding Sarah doesn't UPDATE or DELETE the original entries. It writes new entries: +$89 to her account, -$89 from Zappos's escrow, with a reference to the original transaction. The original history is preserved; the reversal is a separate audit event.
Sarah pays $100 for a service. The platform takes a 3% fee. Three ledger entries, sum still zero:
| Entry | Account | Type | Amount |
|---|---|---|---|
| 1 | sarah_card | DEBIT | −$100.00 |
| 2 | provider_escrow | CREDIT | +$97.00 |
| 3 | platform_revenue | CREDIT | +$3.00 |
| Sum | $0.00 ✓ | ||
SELECT SUM(amount_cents) FROM ledger_entry returns zero. We run that query as a Datadog metric every minute. If it ever drifts, an alarm fires before the next transaction even completes — because something is fundamentally broken and every additional transaction makes it worse.A payment is not an atomic operation — it spans multiple systems we don't control. We can't take a global lock across Stripe, our DB, and the merchant's webhook endpoint. The saga pattern is how we get atomicity-like guarantees without distributed transactions.
workflow processPayment(req): // each step is an "activity" — auto-retried, persistent fraudResult = checkFraud(req) if fraudResult.declined: return failure("fraud") authId = authorize(req.amount, req.payment_method_id) // step 1 try: captureId = capture(authId) // step 2 try: writeLedger(req.amount, req.source, req.dest) // step 3 try: notifyMerchant(req) // step 4 — async, best-effort catch: // step 4 failure does NOT roll back; webhooks retried separately log("webhook will retry") catch e: refund(captureId) // compensate step 2 throw catch e: reverseAuth(authId) // compensate step 1 throw return success
Every step's input and output is persisted before the next step runs. If the orchestrator pod dies between step 2 and step 3, a different pod resumes at step 3 — never re-running step 2 (which already moved real money).
Transient failures (Stripe 502, network blip, DB timeout) are retried with exponential backoff up to a configured max. Permanent failures (declined card, validation error) escalate immediately to compensating actions.
Each step has an inverse. If we capture a charge but can't write the ledger, the orchestrator runs refund against Stripe to undo the capture. Sarah sees an error; her card is not charged. No half-completed state survives.
2PC is the classical answer to multi-system atomicity: a coordinator asks every participant "can you commit?", then if all say yes, says "commit". It does not work for us for two reasons:
Card networks have no "prepare-to-commit" stage. Once we tell Stripe to capture, the money moves. You cannot ask the world's payment networks to please pause their transaction until our other systems are ready.
If the 2PC coordinator dies after sending "prepare" but before "commit", participants hold locks indefinitely waiting for the verdict. In a payment context, that means the merchant's account row is locked for hours. Saga has no global lock, so failures degrade gracefully.
Fraud is the single biggest non-engineering risk to a payment platform. A 1% chargeback rate gets you throttled by Stripe; a 2% chargeback rate gets your account terminated and you lose your business overnight. Catching fraud before the charge is therefore worth a lot of latency budget.
Sub-100ms. Cheap, fast, blocks the obvious. Examples:
Implemented as a Redis-backed counter set + hot rule list, evaluated in <50ms.
~100ms. Catches the subtle. A gradient-boosted-tree model trained on years of historical fraud data scores every transaction on a 0-1 risk scale based on ~200 features:
Score > 0.9 → auto-decline. Score 0.5-0.9 → manual review queue. Score < 0.5 → approve.
"Trust but verify" is the entire job. Every day, a batch process compares our internal ledger against the source-of-truth statements from Stripe, the card networks, and the banks we settle with. The two views must match to the cent. If they don't, an engineer is paged.
(payment_id, amount, currency). Three buckets: (a) match on both sides — green; (b) in Stripe but not in our ledger — red, money received but unrecorded; (c) in our ledger but not in Stripe — red, we think we got paid but Stripe disagrees.A transaction captured at 23:59 UTC might land in our ledger today and Stripe's statement tomorrow. Resolved by widening the window — match against today's and yesterday's Stripe statement.
Stripe webhooks for capture-success can arrive after the next-day statement. We use the workflow's own state, not just webhooks, as the truth — webhooks are a hint, not the source.
Rare but real: an idempotency-key collision, a saga that didn't compensate properly, a manual UPDATE someone ran in production. Hands-on-deck investigation; ledger gets corrective entries (never modified) once root cause is found.
PCI-DSS is the payment-card industry's data-security standard. It is non-optional for anyone handling card data, and the cost of compliance scales sharply with how much of our infrastructure touches card numbers. The goal is therefore to never see a real card number anywhere on our servers.
The Stripe.js iframe is hosted on Stripe's domain, so it doesn't even share the same browsing context as our checkout page. Card data is captured by Stripe, vaulted by Stripe, and we receive an opaque token like pm_abc123 that we can use to charge but cannot reverse-engineer back into a real card number. This single design decision moves our PCI scope from "we are a processor" (PCI Level 1, hundreds of pages of audit) to "we are a tokenized merchant" (PCI SAQ-A, a checklist).
Servers that handle payment tokens live in a tightly-firewalled VPC with no inbound internet access. Egress is whitelisted to Stripe's IPs and our other services only. Audit logs all egress for review.
Application logs run through a redaction pipeline that strips anything matching a card-number pattern (Luhn-checkable digit sequences) before writing to the log store. Even if a developer accidentally logs req.body, the PAN never lands on disk.
Stripe API keys, signing secrets, and DB credentials are held in HashiCorp Vault or AWS Secrets Manager. Apps fetch at boot, no secrets in env files or git. Rotated every 90 days minimum.
An external QSA (Qualified Security Assessor) audits us yearly. Quarterly ASV (Approved Scanning Vendor) scans probe our public surface for vulnerabilities. Findings are tracked to closure within 30-90 days depending on severity.
Sarah pays $89 USD; the merchant settles in EUR. Naïvely you might think "convert the amount, save the converted number". That breaks the audit invariant — the FX spread has to live somewhere too, and the conversion rate at transaction time has to be locked or you can't reconcile.
An account has a fixed currency. sarah_card is USD. zappos_escrow_eur is EUR. A ledger entry's amount is always denominated in that account's currency. FX conversion is itself a transaction with multiple legs — and the spread is credited to a platform revenue account.
| # | Account | Currency | Amount | Note |
|---|---|---|---|---|
| 1 | sarah_card | USD | −$89.00 | Card debited |
| 2 | platform_fx_pool_usd | USD | +$89.00 | USD enters platform |
| 3 | platform_fx_pool_eur | EUR | −€81.88 | EUR leaves platform pool |
| 4 | zappos_escrow_eur | EUR | +€81.88 | Merchant credited at 1 USD = 0.92 EUR |
| 5 | platform_fx_revenue_eur | EUR | +€0.00 | Spread / margin (if applicable) |
Each currency's entries sum to zero on their own — USD entries (1, 2) sum to zero, EUR entries (3, 4, 5) sum to zero. The platform takes on the FX risk; we hedge by holding currency pools and rebalancing them periodically with our banking partners.
1 billion ledger rows per year does not fit on one box once we factor in indices, replicas, and operational headroom. We partition the LedgerEntry table by account_id.
Most queries are "give me all entries for account X" — balance lookup, statement generation, audit. Sharding by account_id keeps an account's full history co-located on one shard, so SUM queries are local and fast.
Spreads each transaction's debit and credit entries across different shards — meaning every transaction commit becomes a distributed write. Atomic ledger writes become hard.
Today's shard is hot; yesterday's is cold. Hot shard becomes the bottleneck. Better used as a secondary partition (sub-partition by month within each account-shard) for archival.
Most accounts are tiny — a customer might have a few transactions per year. But the platform escrow account touches every single transaction we process. At 1M tx/day, escrow has 2M+ entries/day on a single shard. That shard becomes a write bottleneck.
Solution: sub-shard hot accounts. The platform escrow is virtually represented as N "shards" (escrow_001, escrow_002, …, escrow_032). Writes are randomly assigned to one of the N. Reads roll up across all N. The balance is SUM over the sub-shards. Net effect: we trade a tiny amount of read complexity for 32× write throughput on the hot account.
Every component in the system can fail. The interesting question is: does that failure cause money to be lost, money to be duplicated, or just a temporary outage? The first two are unacceptable; the third is recoverable. The architecture is built so that every plausible failure mode lands in bucket three.
| What fails | What happens | How we recover |
|---|---|---|
| Payment API pod crashes mid-request | Client times out, retries with same idempotency key | Retry hits Redis or DB unique-constraint, returns original result |
| Orchestrator pod dies between steps | Workflow paused | Temporal reschedules workflow on a healthy pod, resumes at next step |
| Stripe gateway has 30-min outage | New payments fail on authorize step | Circuit breaker fails fast; client sees error; no money moved; saga not started |
| Stripe drops mid-capture | Workflow doesn't know if capture succeeded | Idempotent retry to Stripe; if still uncertain, query Stripe API for capture status |
| Ledger DB primary loses a disk | Sync replica promoted; ~30s of write blockage | Postgres synchronous replication; orchestrator retries blocked writes |
| Redis idempotency cache wiped | First retry would re-execute | DB UNIQUE constraint on Transaction.idempotency_key catches it; original row read and returned |
| Notification service down | Webhooks not delivered | Async retry with exponential backoff for 3 days; payment itself unaffected |
| Whole AZ goes down | ~30% of capacity lost | Multi-AZ deployment: traffic shifted to remaining AZs; sync replicas promoted; degraded for <5min |
SETNX idempotency:<key> on the fast path, plus a UNIQUE constraint on Transaction.idempotency_key in Postgres as the bulletproof backup. Even if Redis is wiped, the DB rejects the duplicate insert and we return the original cached response. The client can mash "Buy" 100 times — only the first request runs, the rest return the same result.refund(captureId) against Stripe — undoing the capture. Sarah sees a payment-failed error; her card is not charged. The audit log records all five state transitions: authorized → captured → ledger-failed → refund-issued → refund-confirmed. Money is never lost or duplicated; only briefly in flight.SUM(amount_cents) FROM ledger_entry = 0 globally is a free correctness check; if it ever drifts from zero, money has been created or destroyed and we can detect it within a minute. And appending parallel rows scales infinitely; updating one balance row serializes everything on Postgres row locks.pm_... token. Our backend only ever stores tokens. We don't store, process, or transmit primary account numbers — which moves us from PCI Level 1 (full audit) to PCI SAQ-A (a checklist). Backed up by network isolation, secret rotation, log redaction, quarterly ASV scans, and an annual QSA audit.SUM(all entries) = 0 always holds. Every other component exists to support those three properties."