← Back to Design & Development
High-Level Design

Ad Click Aggregator

Counting every ad click exactly once at 1M/sec — the streaming pipeline, idempotency keys, and exactly-once aggregation that turn billions of clicks into trustworthy billing dollars

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →
Step 1

What is an Ad Click Aggregator?

It is 9:14 AM on a Monday. Sarah is a marketing manager running a $10,000/day Google-style campaign for a sneaker launch. She has the dashboard open in one tab and her CFO on Slack in another. Every minute she refreshes — clicks tick up from 12,402 to 12,418 to 12,447 — and the spend column rises with them. At $0.30 per click, every single click matters: count one too many and the advertiser is overcharged; count one too few and the publisher is underpaid; count the same click twice and lawyers get involved. This is the system that has to count those clicks. It must count every real click, exactly once, even when networks drop packets and servers crash, and it must surface those counts to Sarah's dashboard within seconds.

An Ad Click Aggregator is the analytics backbone of every ad network — Google Ads, Meta Ads, TikTok Ads, Criteo. It does two things: (1) ingest a firehose of click events from billions of users and ad placements, and (2) produce trustworthy aggregate counts (per ad, per minute / hour / day / lifetime) that drive both real-time dashboards and end-of-day invoicing. The "exactly once" requirement makes this radically different from, say, a counting like-button — here, mistakes are billed in dollars.

The two questions that drive every design decision below: (1) How do we guarantee a click is counted exactly once, even when the ingest network retries, app servers crash mid-write, and Kafka rewinds during recovery? (2) How do we keep dashboard latency under one minute while ingesting 1,000,000 clicks per second at peak?
Step 2

Requirements & Goals

Money is on the line, so we lock down expectations before drawing any boxes. The functional list is short — counts, queries, idempotency. The non-functional list is the harder one — exactness, throughput, durability, freshness.

✅ Functional Requirements

  • Ingest click events: (ad_id, user_id, timestamp, idempotency_key)
  • Query click count and spend per ad over time windows — 1m, 1h, 1d, all-time
  • Support multiple granularities at the same time (per-minute counters, rolled up to hour and day)
  • Idempotent ingest — same idempotency_key arriving twice is counted once
  • Surface flagged-as-fraud clicks separately so they don't pollute billing numbers

⚙️ Non-Functional Requirements

  • Exact counts — billing depends on it. Approximate is unacceptable
  • High throughput — 1M clicks/sec at peak, growing 30%/year
  • Real-time aggregation — under 1 minute lag from click to dashboard
  • Durable — no click ever lost, even during a regional outage
  • Click endpoint p99 under 50ms — the user is waiting for a redirect to the landing page
The non-functional requirements are the entire design. Counting clicks is trivial — counter++ in any language. Counting them exactly while the world is on fire — networks drop, retries fire, replicas lag, fraudsters spam — is what forces every architectural choice we are about to make.
Step 3

Capacity Estimation & Constraints

Numbers drive sizing of every tier — Kafka partitions, Flink parallelism, Redis cluster, S3 buckets. Run them out loud, even if rough.

Traffic estimates

Assume 10 million active ads on the platform. At peak hour we see 1,000,000 clicks/sec (think Black Friday, World Cup final). Average sustained load is closer to 200K/sec. Each click event is roughly 100 bytes on the wire (ad_id, user_id, timestamp, idempotency_key, IP, user_agent stub).

Peak ingest

1,000,000 clicks/sec

Black Friday spike

Sustained

~200,000 clicks/sec

Daily average

Peak bandwidth

~100 MB/sec

1M × 100 bytes

Daily volume

~8.6 TB/day

200K × 86,400 × 100 bytes (avg)

Storage estimate

Raw events are kept 30 days for billing reconciliation and audits, then aged out to cheaper cold storage or deleted. Aggregates (per-minute, per-hour, per-day counters) are kept forever — they are tiny.

Raw events (30d)

~250 TB

8.6 TB × 30 days, stored as Parquet on S3 with snappy compression — actually closer to 80 TB compressed

Aggregates

~60 GB/year

10M ads × ~1440 minutes/day × 365 days × 12 bytes/counter — tiny next to raw events

Hot cache

~20 GB

Top 1% of ads × recent windows in Redis ZSETs for sub-second dashboard reads

MetricValueWhy it matters
Peak ingest1M clicks/secDrives Kafka partition count and Flink parallelism
Daily volume~8.6 TB/daySizes the S3 cold tier and reconciliation Spark cluster
30-day raw retention~80 TB compressedCost driver — Parquet + snappy keeps it manageable
Click endpoint p99<50msForces async ingest — Kafka publish, no DB write inline
Dashboard freshness<1 minForces stream processing, not batch ETL
Step 4

System APIs

Two endpoints carry the load: a write endpoint that records each click (called millions of times per second) and a read endpoint that powers Sarah's dashboard (called when she refreshes).

REST API surface
// Write — 1M req/sec at peak, must respond <50ms with 302 redirect
GET /click?ad_id=A123&user=u9&ts=1714742400&k=550e8400-e29b-41d4-a716-446655440000
// k = idempotency_key, generated client-side as a UUID v4
→ 302 Found  Location: https://advertiser.com/landing-page

// Read — low QPS, served from Redis aggregate store
GET /api/v1/metrics?ad_id=A123&window=1h&granularity=1m
→ 200 OK
{
  "ad_id":   "A123",
  "window":  "1h",
  "series":  [
    { "t": "2026-05-07T14:00:00Z", "clicks": 1240, "spend_usd": 372.00 },
    { "t": "2026-05-07T14:01:00Z", "clicks": 1318, "spend_usd": 395.40 },
    ...
  ],
  "as_of":   "2026-05-07T15:00:42Z",
  "label":   "estimated"   // real-time path; canonical billing comes from S3 reconciliation
}

// Internal — billing reconciliation produces canonical numbers
GET /api/v1/billing?ad_id=A123&day=2026-05-07
→ 200 OK  { "clicks": 28412, "spend_usd": 8523.60, "label": "canonical" }
Why a GET, not a POST, on the click endpoint? The click is a browser following an <a href> link — the browser fires a GET, then a 302 redirects it to the landing page. We never hold a JSON body or wait for a response — the user is mid-click and any latency feels like the ad is broken. The endpoint must be the lightest thing in the system: validate, publish to Kafka, return 302. No DB touched.
The idempotency_key is the most important parameter on the page. The browser generates a UUID v4 the moment the click happens (or the ad SDK does). If the network drops the request and the browser retries, the second request carries the same key. Downstream — both at the API server and inside Flink — we see the duplicate and drop it. Without this, retries silently double-charge advertisers. With it, we are safe.
Step 5 · CORE

High-Level Architecture — From Naive to Production

This is the section that wins or loses the interview. We build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart under real money, and the production shape where every box justifies itself with a concrete failure it prevents.

Pass 1 — The naive design (and why it breaks)

The simplest design any junior engineer would draw: the click endpoint takes the request, opens a Postgres transaction, runs UPDATE clicks SET count = count + 1 WHERE ad_id = ?, commits, returns the 302. One server, one database, one row per ad.

flowchart LR C["Browser"] --> APP["App Server"] APP --> DB[("Postgres — counters table")] DASH["Dashboard"] --> DB style C fill:#e8743b,stroke:#e8743b,color:#fff style APP fill:#171d27,stroke:#4a90d9,color:#d4dae5 style DB fill:#171d27,stroke:#38b265,color:#d4dae5 style DASH fill:#171d27,stroke:#9b72cf,color:#d4dae5

Three concrete failures emerge the moment real traffic shows up — and each one maps directly to a component we will introduce in Pass 3:

💥 Lock storm on hot rows

1M clicks/sec means 1M UPDATEs/sec. If 10K of those hit the same viral ad's row, Postgres serializes them via row locks — throughput on that row collapses to a few thousand per second. The dashboard query is also blocked behind the write queue. Within a minute, the entire pipeline is melted.

💥 Retries cause double-counting

A flaky cellular network retries the click request. The first one succeeded but the response was lost. The second one increments the counter again. The advertiser is overcharged $0.30 per duplicate, multiplied by millions of clicks, multiplied by every ad in the system. Auditors find the discrepancy at month-end. Lawsuit.

💥 Dashboard reads collide with writes

Sarah refreshes her dashboard. Postgres scans the counters table while millions of writes are in flight. Either reads block writes (latency spike, click endpoint times out) or writes block reads (her dashboard hangs). Either way, the experience is broken when it matters most — during a campaign launch.

Pass 2 — The mental model: stream-based exactly-once aggregation with idempotency keys

Three insights, drilled in until they are obvious. They form the spine of the entire production design.

🎟️ Idempotency keys

Every click event carries a UUID generated at the source — the browser or the ad SDK. Think of it as a receipt number: the same physical purchase, even if the receipt is photocopied a hundred times, is still one purchase. Anywhere downstream, if we see the same key twice, we drop the duplicate. This is what makes "at-least-once delivery" (which is all the network can guarantee) safe to use.

🚦 Decouple write path from aggregation

The click endpoint must be fast and durable — but it does not need to be aggregated yet. We use Kafka as a durable buffer between ingest and aggregation. The endpoint publishes to Kafka and returns; a separate stream processor consumes Kafka at its own pace and computes counters. If aggregation lags by 30 seconds during a spike, no clicks are lost — they pile up in Kafka until the consumer catches up.

🏦 Two output paths — fast vs canonical

Real-time aggregates flow from Flink to Redis for dashboards (estimated, sub-second freshness). Raw events flow from Kafka to S3 in Parquet, where a daily Spark job re-aggregates and produces the canonical billing number (the one the advertiser is invoiced from). Real-time is "what's happening right now". Canonical is "what we charge you". Like a credit card pending charge versus the posted transaction.

This three-way split lets the click endpoint stay sub-50ms, the dashboard stay sub-minute fresh, and the billing system stay forensically auditable — without any one component compromising for the others.

Pass 3 — The production shape

Now the full picture. Every box is numbered ①–⑫ — find its matching card below to see what it does, why it earned its place, and what would break if we removed it tomorrow.

flowchart TB AD["① Ad Render — browser JS / mobile SDK"] subgraph INGEST["Ingest Plane"] CE["② Click Endpoint — 302 redirect"] LB["③ Load Balancer — L7"] API["④ Click Ingestion API"] IDC["⑦ Idempotency Cache — Redis SETNX"] end subgraph STREAM["Stream Plane"] KAF[("⑤ Kafka — topic clicks, 10K partitions by ad_id")] FL["⑥ Flink — exactly-once aggregator"] FRAUD["⑪ Fraud Detection — separate Flink job"] end subgraph STORAGE["Storage Plane"] AGG[("⑧ Aggregate Store — Redis ZSET / DynamoDB")] S3[("⑨ Cold Storage — S3 Parquet, 30d raw")] DASH["⑩ Dashboard API"] end subgraph BILL["Billing Plane"] SPARK["⑫ Billing Reconciliation — daily Spark job"] end AD -->|"GET /click + idempotency_key"| CE CE --> LB LB --> API API -->|"SETNX dedup"| IDC API -->|"publish"| KAF KAF --> FL KAF --> FRAUD KAF -->|"sink"| S3 FL -->|"checkpointed counters"| AGG AGG --> DASH S3 --> SPARK FRAUD -.flagged keys.-> SPARK style AD fill:#e8743b,stroke:#e8743b,color:#fff style CE fill:#171d27,stroke:#e8743b,color:#d4dae5 style LB fill:#171d27,stroke:#9b72cf,color:#d4dae5 style API fill:#171d27,stroke:#e8743b,color:#d4dae5 style IDC fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style KAF fill:#171d27,stroke:#d4a838,color:#d4dae5 style FL fill:#171d27,stroke:#4a90d9,color:#d4dae5 style FRAUD fill:#171d27,stroke:#e05252,color:#d4dae5 style AGG fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style S3 fill:#171d27,stroke:#38b265,color:#d4dae5 style DASH fill:#171d27,stroke:#9b72cf,color:#d4dae5 style SPARK fill:#171d27,stroke:#38b265,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.

Ad Render — Browser / Mobile SDK

The thing that actually shows the ad and detects the click. On the web, it is a snippet of JavaScript that wraps the <a> tag; on mobile, it is the publisher's SDK (Google AdMob, Facebook Audience Network style). The moment the user taps, this layer mints a fresh UUID v4 idempotency_key, builds the click URL, and sends the browser to it.

Solves: the source-of-truth for "this is one click". Without a client-generated idempotency key, the server has no way to distinguish "the user clicked twice in frustration" from "the network retried the same click twice" — the first must count as two, the second as one. Only the client knows.

Click Endpoint

A tiny stateless HTTP service whose only purpose is to be fast. Per request: parse query params, return a 302 redirect to the landing page, and asynchronously hand the click to the ingestion API. Latency budget: under 50ms p99, including TLS termination. Built in Go or Rust, deployed in dozens of regions.

Solves: the user-perceived latency problem. The user is mid-click and the browser is waiting for our redirect. If we did anything heavy here (DB write, RPC, fraud scoring), they would experience the ad as broken. Splitting this from the ingestion API lets the redirect path stay laser-focused.

Load Balancer (L7)

Public-facing application load balancer (AWS ALB, GCP Cloud Load Balancer, or nginx). Distributes click traffic across the click endpoints and ingestion API pods. Terminates TLS, runs health checks every 5s, evicts unhealthy pods automatically.

Solves: single-point-of-failure on the ingest tier. Without an LB, one app crash takes down a slice of the world's ad clicks. With it, traffic shifts in seconds and we lose 1/N of capacity rather than 100%.

Click Ingestion API

Stateless service downstream of the click endpoint. Per request: validate the click (does ad_id exist? is the request well-formed?), enrich it (look up campaign metadata, attach server timestamp, derive country from IP), check the Idempotency Cache ⑦ to drop in-flight retries, then publish the event to Kafka ⑤. Returns to the click endpoint as soon as the publish is acknowledged — Kafka durability is what makes the click "saved".

Solves: isolating heavy validation and enrichment from the latency-critical 302 path. The redirect goes back to the user immediately; ingestion happens on a parallel goroutine. Without this split, every click would carry the full validation cost in user-visible latency.

Kafka — topic clicks

The durable buffer at the heart of the system. The clicks topic is partitioned by ad_id into ~10,000 partitions, replicated 3x across availability zones. At 1M clicks/sec across 10K partitions that's 100 events per partition per second — each broker comfortably handles it. Retention is 7 days (long enough to replay a Flink restart from any reasonable failure).

Solves: three things at once. (1) Durability — once Kafka acks the publish, the click is safe even if every downstream consumer dies. (2) Decoupling — Flink can lag during a spike and Kafka holds the backlog without dropping events. (3) Replay — exactly-once semantics rely on being able to rewind to a known offset, which only a log-structured store like Kafka cleanly supports. Without Kafka, we would need to reinvent all three of these in custom code, badly.

Stream Processor — Apache Flink

The brain of the aggregator. Flink consumes Kafka with exactly-once semantics (transactional consumer + checkpointing every 10 seconds), maintains per-ad counters in RocksDB-backed state, and emits aggregated rows to the Aggregate Store ⑧. It dedupes on idempotency_key within a sliding window, runs tumbling windows for 1m / 1h / 1d granularities in parallel, and rewinds gracefully on a task-manager failure.

Solves: the exactly-once-aggregation problem. A naive consumer might count an event twice if it crashes between updating the counter and committing the Kafka offset. Flink's transactional consumer atomically commits both the offset and the state change — guaranteeing that on recovery, every input event has been counted exactly once. Without Flink (or a Spark Structured Streaming equivalent), we would either lose clicks on crashes or double-count them on retries.

Idempotency Cache (Redis)

A Redis cluster running SETNX idempotency_key 1 EX 300 on every incoming click. If the SETNX returns 0 (key already exists), the click is a duplicate and the API drops it before publishing to Kafka. 5-minute TTL is enough to catch any plausible retry storm without holding state forever.

Solves: dropping in-flight retries early. Flink also dedupes downstream — but the further upstream we catch a duplicate, the cheaper it is. A retry caught at the API costs one Redis lookup; a retry that reaches Flink costs Kafka bandwidth, broker storage, and consumer CPU. Without this cache, the system still produces correct counts, but Kafka throughput would carry 5–10% noise that Flink has to filter.

Aggregate Store (Redis ZSET / DynamoDB)

The serving layer for dashboards. Per-ad, per-window counters keyed by (ad_id, granularity, time_bucket). Flink writes here every checkpoint; the dashboard reads from here. Redis sorted sets are great for "give me the last 60 minutes of ad A123" range queries; DynamoDB is the alternative when we need infinite retention and don't mind millisecond reads.

Solves: sub-second dashboard reads. If the dashboard queried Flink state directly, it would compete with the streaming pipeline for resources. Materializing aggregates into a dedicated read store gives Sarah's dashboard sub-100ms responses without putting any pressure on the streaming hot path.

Cold Storage (S3 Parquet)

Every raw click event flows from Kafka to S3 via a Kafka Connect sink, written as Parquet files partitioned by date=YYYY-MM-DD/hour=HH/. With snappy compression, our 8.6 TB/day of raw events shrinks to roughly 2–3 TB/day on disk. 30-day retention before transition to Glacier for compliance archiving.

Solves: two things. (1) The canonical record-of-truth for billing — Spark reads these files end-of-day to produce the invoiced number. (2) Auditability — when an advertiser disputes a charge, we can point to the exact raw event and the timestamp it landed in our system. Without S3, the only history of a click would be a number in Redis, which is no defense in court.

Dashboard API

The HTTP service that powers Sarah's UI. Reads from the Aggregate Store ⑧, formats the response, slaps an "as_of" timestamp and an "estimated" label on it. Range queries like "last 24 hours of ad A123" become a ZRANGEBYSCORE in Redis — sub-50ms.

Solves: giving the UI a stable contract. The aggregate store schema may evolve (new windows added, storage swapped from Redis to DynamoDB), but the dashboard API surface stays the same. Without this layer, every UI rev would have to know the internal state-store details.

Fraud Detection — separate Flink job

A second Flink job consuming the same Kafka topic. It does not aggregate counts — it watches for suspicious patterns: same IP firing 1,000 clicks in 10 seconds, same user_id clicking the same ad 50 times an hour, click timestamps that are too perfectly spaced (bot signature). Flagged idempotency_keys are written to a "fraud" topic which the billing job ⑫ subtracts at reconciliation time.

Solves: protecting advertisers from being billed for fraudulent traffic without slowing the main aggregation path. Running fraud detection inline with counting would make the dashboard wait on rule evaluation; running it as a side-branch from Kafka means the main path stays fast and fraud is corrected in the canonical billing number.

Billing Reconciliation — daily Spark job

A nightly Spark job that reads the day's raw events from S3, joins against the fraud-flagged keys from ⑪, dedupes on idempotency_key, and produces the canonical per-ad clicks-and-spend number. This is the file that drives the invoice. Slow but exact — runs overnight on a transient cluster, takes 2 hours.

Solves: the exactness contract. Real-time aggregates can drift slightly (Flink restarts, Redis evictions, late-arriving events). Spark over the immutable S3 record produces a number that is guaranteed reproducible — run it twice and you get the same answer to the cent. This is the only number an auditor will trust. Without it, "exact billing" is a hope, not a proof.

Concrete walkthrough — Sarah clicks an ad at 14:02:06

Two real flows, mapped to the numbered components above. Watch how the same click flows down two paths in parallel — the fast path to dashboards, and the durable path to canonical billing.

⚡ Real-time path — Sarah's click reaches her own dashboard in 8 seconds

  1. Sarah taps the sneaker ad on her phone at 14:02:06. The Ad SDK ① mints UUID 550e8400-..., builds the click URL.
  2. Browser GETs the Click Endpoint ② via Load Balancer ③. Endpoint returns 302 to the sneaker landing page within 22ms — Sarah is already loading the product page.
  3. In parallel, the Click Ingestion API ④ validates, runs SETNX on the Idempotency Cache ⑦ (returns 1, first time), publishes the event to Kafka ⑤ partition ad_id=A123. Acked in 4ms.
  4. Flink ⑥ consumes the event ~3s later, dedupes again on the key (still first time), increments the per-minute counter for ad A123 in its RocksDB state, and on the next 10s checkpoint flushes the updated counter to the Aggregate Store ⑧.
  5. Sarah refreshes her dashboard at 14:02:14. The Dashboard API ⑩ reads the latest counter from Redis. The "14:02" bar in her chart now shows 12,448 clicks — one of which is hers. Total elapsed: 8 seconds.
  6. Same event also lands in S3 ⑨ via Kafka Connect, batched into the 14:00–14:59 Parquet file.

🏦 Billing path — the same click is invoiced at 23:00

  1. At 23:00 the daily Spark job ⑫ spins up a transient EMR cluster and reads s3://clicks/date=2026-05-07/ — every Parquet file from the day, ~80 GB compressed.
  2. Spark groups by (ad_id, idempotency_key) first to dedupe, then groups by ad_id to count.
  3. It joins against the Fraud topic ⑪ to subtract any keys flagged during the day. Sarah's click is real, so it survives.
  4. Output: a canonical row (ad_id=A123, day=2026-05-07, clicks=28412, spend_usd=8523.60) written to a billing table. This is the number on the advertiser's invoice.
  5. If the dashboard's real-time number for the day was 28,415, the 3-click discrepancy is logged. Below 0.1% drift is normal (late-arriving events, fraud subtractions). Above 1% triggers a paging alert.
So what: the system is built around three insights — (1) idempotency keys turn unreliable retries into safe at-least-once delivery; (2) Kafka + Flink turn at-least-once into exactly-once aggregation; (3) two output paths (Redis for fast, S3+Spark for canonical) let dashboards be quick and billing be exact, without compromising either. Every box in the diagram exists to remove one of the failure modes from Pass 1.
Step 6

Idempotency — The Core Design Decision

Of every choice in this design, the idempotency key is the one that makes everything else possible. It is also the choice that junior designs miss most often. Here is why it matters and exactly how it flows through the system.

The fundamental problem: networks are unreliable. The browser sends the click; the server receives it; the response is dropped on the way back. The browser, having no way to know whether the request succeeded, retries. Now the server has received the same click twice. Without an idempotency key, every retry doubles the count. With one, the second arrival is recognized and dropped.

The dedup chain — same key, four checkpoints

sequenceDiagram actor U as User Click participant SDK as Ad SDK / Browser participant API as Click Ingestion API participant RC as Redis SETNX participant K as Kafka participant F as Flink U->>SDK: tap SDK->>SDK: mint UUID v4 = "550e..." SDK->>API: GET /click?k=550e... API->>RC: SETNX 550e... 1 EX 300 RC-->>API: 1 (first time) API->>K: produce {ad, key=550e...} K-->>API: ack API-->>SDK: 302 redirect Note over SDK,API: Network drops response · browser retries SDK->>API: GET /click?k=550e... (retry) API->>RC: SETNX 550e... 1 EX 300 RC-->>API: 0 (already exists) API-->>SDK: 302 redirect (no Kafka publish) K->>F: consume event 550e... F->>F: dedup key in window state F->>F: increment ad counter

The key is checked at four checkpoints, each with a different role:

🎟️ Checkpoint 1 — Client mints the key

The instant the user taps, the SDK generates a UUID v4. Same physical tap, same key, no matter how many times the request is retried. This is what makes "the same click" a well-defined concept across the network.

🚪 Checkpoint 2 — API hits Redis SETNX

Redis returns 1 on first arrival (we proceed) or 0 on duplicate (we silently drop). 5-minute TTL covers any plausible retry window. Catches 99% of duplicates before they cost any downstream resources.

📜 Checkpoint 3 — Flink dedups in window state

Even if a duplicate slips past Redis (cache eviction, late retry past TTL), Flink keeps a per-window Set<idempotency_key> in RocksDB state and skips counting any key it has seen before. This is the true guarantee — Redis is just a fast pre-filter.

🏦 Checkpoint 4 — Spark dedups at billing

The nightly Spark job groups raw events by idempotency_key before counting. This is the final, canonical dedup — even if Flink double-counted by some bug, the billing number is still correct because Spark sees the truth in S3.

The interview move: when asked "how do you handle a flaky network where requests retry?" — answer "client-side idempotency keys, deduped at four layers". Then walk through the four layers. This shows you understand defense in depth rather than relying on any single component being perfect.
Step 7

Exactly-Once with Flink

"Exactly-once" is one of the most over-promised phrases in distributed systems. What Flink + Kafka actually give you is exactly-once effect on aggregated state — meaning, no matter how many times an input event flows through the pipeline due to retries or crashes, its contribution to the output counter is recorded once. Here is how.

The three pieces that compose exactly-once

1️⃣ Transactional Kafka consumer

Flink consumes Kafka inside a transaction. The Kafka offset commit and the state update are part of the same atomic operation — either both happen or neither does.

2️⃣ Periodic checkpoints

Every 10 seconds, Flink snapshots its state (the per-ad counters, the seen-idempotency-key set) plus the corresponding Kafka offsets to durable storage (S3 or HDFS).

3️⃣ Replay-from-checkpoint on failure

If a Flink task crashes, the job restarts, loads the latest checkpoint, and resumes consuming Kafka from the saved offset. Any events between the crash and the last checkpoint are re-read and re-counted, but because the state was rolled back too, no double-counting occurs.

The mental model: exactly-once is a bank deposit. The teller stamps the deposit slip and updates the ledger in one atomic step. If they faint mid-deposit, you can replay the transaction safely — the slip and the ledger are either both updated or both not. Flink's checkpoint is the stamp; Kafka's offset is the slip; RocksDB state is the ledger.

Why not Spark Structured Streaming? Spark also offers exactly-once, but its mini-batch model has higher minimum latency (30s+ end-to-end vs Flink's sub-second). For a billing system where dashboards need under-1-minute lag, Flink's true streaming wins. For pure batch reconciliation (the nightly job), Spark wins on cost and ecosystem maturity.
Step 8

Sliding Windows & Granularity

The dashboard needs counts for multiple time windows simultaneously — "last minute", "last hour", "today", "all-time". Two strategies, with different cost and freshness profiles.

🪟 Strategy A — Multiple parallel tumbling windows in Flink

Flink runs several tumbling windows in parallel — 1m, 1h, 1d. Each window emits a counter row to the Aggregate Store when it closes. Pros: the dashboard reads pre-aggregated rows directly, sub-millisecond latency. Cons: 3× the state size and 3× the write volume to the aggregate store.

📊 Strategy B — Per-minute counters, roll up at query time

Flink only emits per-minute counters. The dashboard API computes "last hour" by summing the last 60 minute-buckets, "today" by summing 1440. Pros: 1/3 the state, simpler windowing. Cons: every dashboard read does a fan-out sum — fine in Redis (60 ZSCOREs are sub-ms), problematic at higher granularity.

The production choice is hybrid: Flink emits per-minute and per-day counters; the dashboard rolls up per-hour from per-minute (60 cells) and "all-time" from per-day. Per-minute is what dashboards refresh against; per-day is what survives forever for historical reporting. Per-hour is just a sum over 60 cells — cheap.

Step 9

The Hot Ad Problem

Picture a Super Bowl ad that gets 100,000 clicks per second for 60 seconds. Our Kafka topic is partitioned by ad_id, so all 100K clicks land on the same partition. One Flink task handles that partition. That single task is now doing 100x its share of work, while the other 9,999 partitions are nearly idle.

This is the classic hot key problem in stream processing. Two solutions, both used in production at different scales.

🧂 Solution A — Salting (key sub-sharding)

Instead of partitioning by ad_id, partition by (ad_id, random_salt 0-99). Now a single ad's clicks fan out across 100 partitions. Flink aggregates per (ad, salt) in parallel, then a second step merges the 100 sub-counts into the final per-ad counter. Pros: scales linearly. Cons: requires a two-stage pipeline (sub-aggregate, then re-aggregate).

🌳 Solution B — Hierarchical aggregation

Local pre-aggregation at the ingestion API layer — buffer clicks for 1 second, send {ad_id, count: N} instead of N individual events. Reduces Kafka volume 100x for hot ads at the cost of 1s extra dashboard lag. Effectively the same idea as a database write-back cache.

Detect-and-react: in the steady state, plain ad_id partitioning is fine. Auto-scale by monitoring per-partition lag — when one partition's lag exceeds a threshold, dynamically rekey it with salt. Most days, no salting is needed; on Super Bowl day, the system shifts gear automatically.
Step 10

Billing Reconciliation

Why have two aggregation paths? Because the real-time dashboard and the invoice serve different masters.

⚡ Real-time aggregates — for humans, labeled "estimated"

Dashboards refresh every minute. Drift of ±0.1% is fine — Sarah is monitoring trend, not balancing books. We trade exactness for freshness. Counts are eventually correct as Flink catches up.

🏦 Canonical billing — for invoices, exact to the cent

Daily Spark job over immutable S3 events. Reproducible — run it again next year and you get the same number. This is what the advertiser is charged. Slow (2 hours) but provably exact.

Drift monitoring

At end of day, compare the real-time number against the canonical number per ad. Acceptable drift is <0.5%. If drift exceeds 1%, page on-call — something in the streaming pipeline is misbehaving (Flink restart that lost late events, Redis eviction during a memory pressure spike, etc.). The drift dashboard is itself a critical piece of operational tooling.

Transparency in the UI: every dashboard number carries an "as_of" timestamp and an "estimated" label. Yesterday's row, after the nightly reconciliation, flips its label to "final". Advertisers have learned to trust the contrast — "today's number is approximate, yesterday's is invoice-grade".
Step 11

Fraud Detection

Click fraud is real and expensive. Bot farms, click rings, competitors burning a rival's daily budget — every ad network deals with it. Our fraud detection runs as a side-branch off Kafka, not inline with counting, so the main pipeline stays fast and the fraud rules can be updated without redeploying the aggregator.

Signals the Flink fraud job watches for

🌐 IP velocity

Same IP fires more than N clicks per minute across any ads. Threshold tuned by ASN — a corporate NAT looks different from a residential ISP.

👤 User-ad repetition

Same user_id clicks the same ad_id more than N times per hour. A real user clicks an ad once or twice; a bot clicks it 200 times.

⏱️ Timestamp regularity

Click timestamps too perfectly spaced (every 4.2 seconds, ±0.1s). Real human clicks are jittery; bots are not.

🤖 User-agent patterns

Headless browsers, outdated mobile UAs, missing referrer headers. Each is weak alone; correlated across an IP they're a strong signal.

🌍 Geo-IP mismatch

Click claims to be from a US user but IP is a known data-center range in Eastern Europe. Auto-flag for review.

🎯 Conversion rate cliff

A campaign suddenly gets 10x more clicks than baseline but zero conversions. Either incredibly bad ad copy or a click-fraud attack. Alert the campaign owner.

Flagged idempotency_keys are written to a fraud Kafka topic. The billing reconciliation job ⑫ reads this topic and subtracts flagged clicks from the canonical invoice. Real-time dashboards can also surface a "suspected fraud" overlay so Sarah sees gross vs net counts.

Step 12

Data Partitioning

Each storage layer has its own partitioning scheme — chosen to match the access pattern of that layer, not the layer above or below.

LayerPartition keyWhy
Kafka clicks topicad_idAll clicks for one ad land on one Flink task — trivial in-task aggregation. Hot ads handled via salting (§9).
Flink state(ad_id, window_start)Counters are partitioned the same way Kafka is, so Flink's keyed state lines up with input partitions — no shuffle.
Aggregate Store (Redis)ad_id (consistent hash)Dashboard reads a single ad's series — that ad's data lives on one Redis shard. ZRANGE is local.
S3 Parquetdate=YYYY-MM-DD/hour=HH/Spark scans a day or hour at a time — date-prefix partitioning lets it skip everything else (partition pruning).
Idempotency Cacheidempotency_key hashEven spread across Redis cluster nodes; SETNX is single-node, no cross-shard transactions needed.
The recurring theme: partition by the field you query by. Kafka and Flink partition by ad_id because aggregation is per-ad. S3 partitions by date because reconciliation is per-day. Redis partitions by ad_id again because dashboards are per-ad. Mixing schemes (e.g., Kafka by ad_id but S3 also by ad_id) would be a disaster — Spark would fan out across millions of tiny files.
Step 13

Fault Tolerance

What survives what kind of failure? Walk through the failure scenarios that matter.

💥 An app server crashes mid-request

Browser sees a connection reset, retries with the same idempotency key. Redis SETNX on the second arrival returns 0, the click is dropped (correctly — it was already published before the crash). If the crash happened before Kafka publish, the key was never set in Redis either, so the retry succeeds normally.

💥 A Kafka broker dies

3-replica replication: each partition has a leader and 2 followers across AZs. On leader loss, a follower is elected leader within seconds. In-flight produces retry against the new leader. Acked writes are durable because acks=all requires majority replication before ack.

💥 A Flink task manager crashes

Job manager detects the lost heartbeat, restarts the task on another node, restores state from the latest checkpoint, resumes Kafka consumption from the saved offset. 10s checkpoint interval means at most 10s of replay. State and offset roll back together — no double-counting.

💥 Redis aggregate store loses a node

Redis Cluster with replicas — failover in seconds. If the replica was lagging, the dashboard briefly shows slightly stale numbers; the next Flink checkpoint catches it up. The canonical billing number, sourced from S3, is unaffected.

💥 An entire region goes dark

Active-passive multi-region. Click endpoints in other regions stay up; Kafka MirrorMaker replicates the topic cross-region. The passive region's Flink job is hot-warm — promoted to active within minutes. Dashboard sees a brief gap; billing reconciliation backfills from the replicated S3.

💥 S3 has a glitch (rare but real)

Kafka holds 7 days of retention. If S3 sink falls behind, it catches up when S3 recovers — no data lost. Billing is delayed by hours, not lost. We'd page on-call but the system self-heals.

The pattern: every layer has at least one redundancy mechanism, and the layers compose. A click that has been Kafka-acked is durable to Flink crashes. A click in S3 is durable to Kafka outages. A canonical billing number from Spark is durable to Redis bugs. The only thing that can lose a click is a failure before Kafka acks — that window is single-digit milliseconds and is mitigated by client-side retries.
Step 14

Interview Q&A

How do you guarantee exactly-once when the network can drop messages?
Idempotency keys plus exactly-once stream processing. Client mints a UUID per physical click; server dedupes via Redis SETNX (catches 99% of duplicates cheaply); Flink dedupes again in window state (the actual guarantee); Spark dedupes a third time at billing reconciliation (the canonical answer). At-least-once delivery on the network combined with deduplication at every layer becomes effectively exactly-once at the output. No single component is trusted — defense in depth.
What's the trade-off between approximate and exact aggregation?
We do both, on purpose. Real-time aggregates are approximate (eventually consistent, ±0.1% drift) so dashboards can be sub-minute fresh — Sarah needs to know "is the campaign trending up?" not "exactly how much do I owe by 14:02:06". Canonical billing is exact (Spark over immutable S3) but takes hours — that's fine because invoices are sent monthly, not by the second. Mixing the two paths gives us speed where speed matters and exactness where money matters.
How do you handle a hot ad with 100K clicks/sec on one Kafka partition?
Salt the partition key. Switch from ad_id to (ad_id, salt 0–99) for that hot ad — the clicks now fan across 100 partitions instead of one. A second Flink stage merges the 100 sub-counters into the final per-ad number. Auto-detect via per-partition lag monitoring: when a partition's lag exceeds a threshold, dynamically rekey. In steady state no salting is needed; on Super Bowl day the system reshapes itself.
Why Flink and not Spark Structured Streaming?
Latency. Flink is a true streaming engine — events flow record-by-record with sub-second end-to-end latency. Spark Structured Streaming is mini-batch — minimum latency around 30 seconds. For a dashboard that must refresh in under a minute, Flink wins. Spark still wins for the nightly batch billing reconciliation — that's a different workload (large scan of immutable files) where Flink is unnecessarily complex.
How do you bill the advertiser fairly when network failures cause retries?
The idempotency key is the contract. One physical click → one UUID → one billable event, no matter how many retries reach our servers. We bill from the canonical Spark output, which dedupes on idempotency_key as its first step. If a retry reached us 50 times, the advertiser is billed once. If somehow none of the 50 reached us, the click was never recorded — but that requires every retry to fail, which is vanishingly rare with a sub-second retry window.
What if the click happens but Kafka is briefly down?
The click endpoint buffers locally. If the Kafka publish fails after retries, the API server writes the event to a local disk-backed dead-letter queue (think: append-only log on the pod's local SSD). A background drainer flushes the DLQ to Kafka the moment it recovers. The user still got their 302 redirect within 50ms — they never know there was a hiccup. Alternative: write to S3 directly as a fallback, then Spark picks it up at reconciliation. Either way, the click is not lost.
How would you add A/B test analytics on top of this?
Add a variant_id column to the click event and partition Flink's aggregation by (ad_id, variant_id) instead of just ad_id. Now every counter is per-variant. The dashboard shows side-by-side conversion rates. For statistical significance, the same Flink job emits per-variant impression counts and the dashboard runs a chi-square test client-side. The reconciliation job also splits by variant, so canonical billing reflects which variant earned the spend. Zero new infrastructure — just a wider key.
What's the storage cost story for 30 days of raw events?
8.6 TB/day uncompressed × 30 days = 258 TB. With Parquet + snappy on S3, that compresses to roughly 80 TB. At S3 Standard pricing (~$23/TB/month), that's ~$1,840/month — utterly trivial next to the revenue an ad network generates. After 30 days we transition to Glacier Deep Archive (~$1/TB/month) for compliance retention, dropping the marginal cost by 95%. The cost story is dominated not by storage but by Kafka and Flink compute.
The one-line summary the interviewer remembers: "Client-minted idempotency keys plus Kafka-buffered exactly-once Flink aggregation, with two output paths — Redis for sub-minute dashboards (estimated) and Spark over S3 for canonical billing (exact). Three checkpoints of dedup and a side-branch fraud job make every billed click defensible to a court."