Counting every ad click exactly once at 1M/sec — the streaming pipeline, idempotency keys, and exactly-once aggregation that turn billions of clicks into trustworthy billing dollars
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
It is 9:14 AM on a Monday. Sarah is a marketing manager running a $10,000/day Google-style campaign for a sneaker launch. She has the dashboard open in one tab and her CFO on Slack in another. Every minute she refreshes — clicks tick up from 12,402 to 12,418 to 12,447 — and the spend column rises with them. At $0.30 per click, every single click matters: count one too many and the advertiser is overcharged; count one too few and the publisher is underpaid; count the same click twice and lawyers get involved. This is the system that has to count those clicks. It must count every real click, exactly once, even when networks drop packets and servers crash, and it must surface those counts to Sarah's dashboard within seconds.
An Ad Click Aggregator is the analytics backbone of every ad network — Google Ads, Meta Ads, TikTok Ads, Criteo. It does two things: (1) ingest a firehose of click events from billions of users and ad placements, and (2) produce trustworthy aggregate counts (per ad, per minute / hour / day / lifetime) that drive both real-time dashboards and end-of-day invoicing. The "exactly once" requirement makes this radically different from, say, a counting like-button — here, mistakes are billed in dollars.
Money is on the line, so we lock down expectations before drawing any boxes. The functional list is short — counts, queries, idempotency. The non-functional list is the harder one — exactness, throughput, durability, freshness.
(ad_id, user_id, timestamp, idempotency_key)idempotency_key arriving twice is counted oncecounter++ in any language. Counting them exactly while the world is on fire — networks drop, retries fire, replicas lag, fraudsters spam — is what forces every architectural choice we are about to make.Numbers drive sizing of every tier — Kafka partitions, Flink parallelism, Redis cluster, S3 buckets. Run them out loud, even if rough.
Assume 10 million active ads on the platform. At peak hour we see 1,000,000 clicks/sec (think Black Friday, World Cup final). Average sustained load is closer to 200K/sec. Each click event is roughly 100 bytes on the wire (ad_id, user_id, timestamp, idempotency_key, IP, user_agent stub).
1,000,000 clicks/sec
Black Friday spike
~200,000 clicks/sec
Daily average
~100 MB/sec
1M × 100 bytes
~8.6 TB/day
200K × 86,400 × 100 bytes (avg)
Raw events are kept 30 days for billing reconciliation and audits, then aged out to cheaper cold storage or deleted. Aggregates (per-minute, per-hour, per-day counters) are kept forever — they are tiny.
~250 TB
8.6 TB × 30 days, stored as Parquet on S3 with snappy compression — actually closer to 80 TB compressed
~60 GB/year
10M ads × ~1440 minutes/day × 365 days × 12 bytes/counter — tiny next to raw events
~20 GB
Top 1% of ads × recent windows in Redis ZSETs for sub-second dashboard reads
| Metric | Value | Why it matters |
|---|---|---|
| Peak ingest | 1M clicks/sec | Drives Kafka partition count and Flink parallelism |
| Daily volume | ~8.6 TB/day | Sizes the S3 cold tier and reconciliation Spark cluster |
| 30-day raw retention | ~80 TB compressed | Cost driver — Parquet + snappy keeps it manageable |
| Click endpoint p99 | <50ms | Forces async ingest — Kafka publish, no DB write inline |
| Dashboard freshness | <1 min | Forces stream processing, not batch ETL |
Two endpoints carry the load: a write endpoint that records each click (called millions of times per second) and a read endpoint that powers Sarah's dashboard (called when she refreshes).
REST API surface// Write — 1M req/sec at peak, must respond <50ms with 302 redirect GET /click?ad_id=A123&user=u9&ts=1714742400&k=550e8400-e29b-41d4-a716-446655440000 // k = idempotency_key, generated client-side as a UUID v4 → 302 Found Location: https://advertiser.com/landing-page // Read — low QPS, served from Redis aggregate store GET /api/v1/metrics?ad_id=A123&window=1h&granularity=1m → 200 OK { "ad_id": "A123", "window": "1h", "series": [ { "t": "2026-05-07T14:00:00Z", "clicks": 1240, "spend_usd": 372.00 }, { "t": "2026-05-07T14:01:00Z", "clicks": 1318, "spend_usd": 395.40 }, ... ], "as_of": "2026-05-07T15:00:42Z", "label": "estimated" // real-time path; canonical billing comes from S3 reconciliation } // Internal — billing reconciliation produces canonical numbers GET /api/v1/billing?ad_id=A123&day=2026-05-07 → 200 OK { "clicks": 28412, "spend_usd": 8523.60, "label": "canonical" }
idempotency_key is the most important parameter on the page. The browser generates a UUID v4 the moment the click happens (or the ad SDK does). If the network drops the request and the browser retries, the second request carries the same key. Downstream — both at the API server and inside Flink — we see the duplicate and drop it. Without this, retries silently double-charge advertisers. With it, we are safe.This is the section that wins or loses the interview. We build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart under real money, and the production shape where every box justifies itself with a concrete failure it prevents.
The simplest design any junior engineer would draw: the click endpoint takes the request, opens a Postgres transaction, runs UPDATE clicks SET count = count + 1 WHERE ad_id = ?, commits, returns the 302. One server, one database, one row per ad.
Three concrete failures emerge the moment real traffic shows up — and each one maps directly to a component we will introduce in Pass 3:
1M clicks/sec means 1M UPDATEs/sec. If 10K of those hit the same viral ad's row, Postgres serializes them via row locks — throughput on that row collapses to a few thousand per second. The dashboard query is also blocked behind the write queue. Within a minute, the entire pipeline is melted.
A flaky cellular network retries the click request. The first one succeeded but the response was lost. The second one increments the counter again. The advertiser is overcharged $0.30 per duplicate, multiplied by millions of clicks, multiplied by every ad in the system. Auditors find the discrepancy at month-end. Lawsuit.
Sarah refreshes her dashboard. Postgres scans the counters table while millions of writes are in flight. Either reads block writes (latency spike, click endpoint times out) or writes block reads (her dashboard hangs). Either way, the experience is broken when it matters most — during a campaign launch.
Three insights, drilled in until they are obvious. They form the spine of the entire production design.
Every click event carries a UUID generated at the source — the browser or the ad SDK. Think of it as a receipt number: the same physical purchase, even if the receipt is photocopied a hundred times, is still one purchase. Anywhere downstream, if we see the same key twice, we drop the duplicate. This is what makes "at-least-once delivery" (which is all the network can guarantee) safe to use.
The click endpoint must be fast and durable — but it does not need to be aggregated yet. We use Kafka as a durable buffer between ingest and aggregation. The endpoint publishes to Kafka and returns; a separate stream processor consumes Kafka at its own pace and computes counters. If aggregation lags by 30 seconds during a spike, no clicks are lost — they pile up in Kafka until the consumer catches up.
Real-time aggregates flow from Flink to Redis for dashboards (estimated, sub-second freshness). Raw events flow from Kafka to S3 in Parquet, where a daily Spark job re-aggregates and produces the canonical billing number (the one the advertiser is invoiced from). Real-time is "what's happening right now". Canonical is "what we charge you". Like a credit card pending charge versus the posted transaction.
This three-way split lets the click endpoint stay sub-50ms, the dashboard stay sub-minute fresh, and the billing system stay forensically auditable — without any one component compromising for the others.
Now the full picture. Every box is numbered ①–⑫ — find its matching card below to see what it does, why it earned its place, and what would break if we removed it tomorrow.
Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.
The thing that actually shows the ad and detects the click. On the web, it is a snippet of JavaScript that wraps the <a> tag; on mobile, it is the publisher's SDK (Google AdMob, Facebook Audience Network style). The moment the user taps, this layer mints a fresh UUID v4 idempotency_key, builds the click URL, and sends the browser to it.
Solves: the source-of-truth for "this is one click". Without a client-generated idempotency key, the server has no way to distinguish "the user clicked twice in frustration" from "the network retried the same click twice" — the first must count as two, the second as one. Only the client knows.
A tiny stateless HTTP service whose only purpose is to be fast. Per request: parse query params, return a 302 redirect to the landing page, and asynchronously hand the click to the ingestion API. Latency budget: under 50ms p99, including TLS termination. Built in Go or Rust, deployed in dozens of regions.
Solves: the user-perceived latency problem. The user is mid-click and the browser is waiting for our redirect. If we did anything heavy here (DB write, RPC, fraud scoring), they would experience the ad as broken. Splitting this from the ingestion API lets the redirect path stay laser-focused.
Public-facing application load balancer (AWS ALB, GCP Cloud Load Balancer, or nginx). Distributes click traffic across the click endpoints and ingestion API pods. Terminates TLS, runs health checks every 5s, evicts unhealthy pods automatically.
Solves: single-point-of-failure on the ingest tier. Without an LB, one app crash takes down a slice of the world's ad clicks. With it, traffic shifts in seconds and we lose 1/N of capacity rather than 100%.
Stateless service downstream of the click endpoint. Per request: validate the click (does ad_id exist? is the request well-formed?), enrich it (look up campaign metadata, attach server timestamp, derive country from IP), check the Idempotency Cache ⑦ to drop in-flight retries, then publish the event to Kafka ⑤. Returns to the click endpoint as soon as the publish is acknowledged — Kafka durability is what makes the click "saved".
Solves: isolating heavy validation and enrichment from the latency-critical 302 path. The redirect goes back to the user immediately; ingestion happens on a parallel goroutine. Without this split, every click would carry the full validation cost in user-visible latency.
clicksThe durable buffer at the heart of the system. The clicks topic is partitioned by ad_id into ~10,000 partitions, replicated 3x across availability zones. At 1M clicks/sec across 10K partitions that's 100 events per partition per second — each broker comfortably handles it. Retention is 7 days (long enough to replay a Flink restart from any reasonable failure).
Solves: three things at once. (1) Durability — once Kafka acks the publish, the click is safe even if every downstream consumer dies. (2) Decoupling — Flink can lag during a spike and Kafka holds the backlog without dropping events. (3) Replay — exactly-once semantics rely on being able to rewind to a known offset, which only a log-structured store like Kafka cleanly supports. Without Kafka, we would need to reinvent all three of these in custom code, badly.
The brain of the aggregator. Flink consumes Kafka with exactly-once semantics (transactional consumer + checkpointing every 10 seconds), maintains per-ad counters in RocksDB-backed state, and emits aggregated rows to the Aggregate Store ⑧. It dedupes on idempotency_key within a sliding window, runs tumbling windows for 1m / 1h / 1d granularities in parallel, and rewinds gracefully on a task-manager failure.
Solves: the exactly-once-aggregation problem. A naive consumer might count an event twice if it crashes between updating the counter and committing the Kafka offset. Flink's transactional consumer atomically commits both the offset and the state change — guaranteeing that on recovery, every input event has been counted exactly once. Without Flink (or a Spark Structured Streaming equivalent), we would either lose clicks on crashes or double-count them on retries.
A Redis cluster running SETNX idempotency_key 1 EX 300 on every incoming click. If the SETNX returns 0 (key already exists), the click is a duplicate and the API drops it before publishing to Kafka. 5-minute TTL is enough to catch any plausible retry storm without holding state forever.
Solves: dropping in-flight retries early. Flink also dedupes downstream — but the further upstream we catch a duplicate, the cheaper it is. A retry caught at the API costs one Redis lookup; a retry that reaches Flink costs Kafka bandwidth, broker storage, and consumer CPU. Without this cache, the system still produces correct counts, but Kafka throughput would carry 5–10% noise that Flink has to filter.
The serving layer for dashboards. Per-ad, per-window counters keyed by (ad_id, granularity, time_bucket). Flink writes here every checkpoint; the dashboard reads from here. Redis sorted sets are great for "give me the last 60 minutes of ad A123" range queries; DynamoDB is the alternative when we need infinite retention and don't mind millisecond reads.
Solves: sub-second dashboard reads. If the dashboard queried Flink state directly, it would compete with the streaming pipeline for resources. Materializing aggregates into a dedicated read store gives Sarah's dashboard sub-100ms responses without putting any pressure on the streaming hot path.
Every raw click event flows from Kafka to S3 via a Kafka Connect sink, written as Parquet files partitioned by date=YYYY-MM-DD/hour=HH/. With snappy compression, our 8.6 TB/day of raw events shrinks to roughly 2–3 TB/day on disk. 30-day retention before transition to Glacier for compliance archiving.
Solves: two things. (1) The canonical record-of-truth for billing — Spark reads these files end-of-day to produce the invoiced number. (2) Auditability — when an advertiser disputes a charge, we can point to the exact raw event and the timestamp it landed in our system. Without S3, the only history of a click would be a number in Redis, which is no defense in court.
The HTTP service that powers Sarah's UI. Reads from the Aggregate Store ⑧, formats the response, slaps an "as_of" timestamp and an "estimated" label on it. Range queries like "last 24 hours of ad A123" become a ZRANGEBYSCORE in Redis — sub-50ms.
Solves: giving the UI a stable contract. The aggregate store schema may evolve (new windows added, storage swapped from Redis to DynamoDB), but the dashboard API surface stays the same. Without this layer, every UI rev would have to know the internal state-store details.
A second Flink job consuming the same Kafka topic. It does not aggregate counts — it watches for suspicious patterns: same IP firing 1,000 clicks in 10 seconds, same user_id clicking the same ad 50 times an hour, click timestamps that are too perfectly spaced (bot signature). Flagged idempotency_keys are written to a "fraud" topic which the billing job ⑫ subtracts at reconciliation time.
Solves: protecting advertisers from being billed for fraudulent traffic without slowing the main aggregation path. Running fraud detection inline with counting would make the dashboard wait on rule evaluation; running it as a side-branch from Kafka means the main path stays fast and fraud is corrected in the canonical billing number.
A nightly Spark job that reads the day's raw events from S3, joins against the fraud-flagged keys from ⑪, dedupes on idempotency_key, and produces the canonical per-ad clicks-and-spend number. This is the file that drives the invoice. Slow but exact — runs overnight on a transient cluster, takes 2 hours.
Solves: the exactness contract. Real-time aggregates can drift slightly (Flink restarts, Redis evictions, late-arriving events). Spark over the immutable S3 record produces a number that is guaranteed reproducible — run it twice and you get the same answer to the cent. This is the only number an auditor will trust. Without it, "exact billing" is a hope, not a proof.
Two real flows, mapped to the numbered components above. Watch how the same click flows down two paths in parallel — the fast path to dashboards, and the durable path to canonical billing.
550e8400-..., builds the click URL.SETNX on the Idempotency Cache ⑦ (returns 1, first time), publishes the event to Kafka ⑤ partition ad_id=A123. Acked in 4ms.s3://clicks/date=2026-05-07/ — every Parquet file from the day, ~80 GB compressed.(ad_id, idempotency_key) first to dedupe, then groups by ad_id to count.(ad_id=A123, day=2026-05-07, clicks=28412, spend_usd=8523.60) written to a billing table. This is the number on the advertiser's invoice.Of every choice in this design, the idempotency key is the one that makes everything else possible. It is also the choice that junior designs miss most often. Here is why it matters and exactly how it flows through the system.
The fundamental problem: networks are unreliable. The browser sends the click; the server receives it; the response is dropped on the way back. The browser, having no way to know whether the request succeeded, retries. Now the server has received the same click twice. Without an idempotency key, every retry doubles the count. With one, the second arrival is recognized and dropped.
The key is checked at four checkpoints, each with a different role:
The instant the user taps, the SDK generates a UUID v4. Same physical tap, same key, no matter how many times the request is retried. This is what makes "the same click" a well-defined concept across the network.
Redis returns 1 on first arrival (we proceed) or 0 on duplicate (we silently drop). 5-minute TTL covers any plausible retry window. Catches 99% of duplicates before they cost any downstream resources.
Even if a duplicate slips past Redis (cache eviction, late retry past TTL), Flink keeps a per-window Set<idempotency_key> in RocksDB state and skips counting any key it has seen before. This is the true guarantee — Redis is just a fast pre-filter.
The nightly Spark job groups raw events by idempotency_key before counting. This is the final, canonical dedup — even if Flink double-counted by some bug, the billing number is still correct because Spark sees the truth in S3.
"Exactly-once" is one of the most over-promised phrases in distributed systems. What Flink + Kafka actually give you is exactly-once effect on aggregated state — meaning, no matter how many times an input event flows through the pipeline due to retries or crashes, its contribution to the output counter is recorded once. Here is how.
Flink consumes Kafka inside a transaction. The Kafka offset commit and the state update are part of the same atomic operation — either both happen or neither does.
Every 10 seconds, Flink snapshots its state (the per-ad counters, the seen-idempotency-key set) plus the corresponding Kafka offsets to durable storage (S3 or HDFS).
If a Flink task crashes, the job restarts, loads the latest checkpoint, and resumes consuming Kafka from the saved offset. Any events between the crash and the last checkpoint are re-read and re-counted, but because the state was rolled back too, no double-counting occurs.
The mental model: exactly-once is a bank deposit. The teller stamps the deposit slip and updates the ledger in one atomic step. If they faint mid-deposit, you can replay the transaction safely — the slip and the ledger are either both updated or both not. Flink's checkpoint is the stamp; Kafka's offset is the slip; RocksDB state is the ledger.
The dashboard needs counts for multiple time windows simultaneously — "last minute", "last hour", "today", "all-time". Two strategies, with different cost and freshness profiles.
Flink runs several tumbling windows in parallel — 1m, 1h, 1d. Each window emits a counter row to the Aggregate Store when it closes. Pros: the dashboard reads pre-aggregated rows directly, sub-millisecond latency. Cons: 3× the state size and 3× the write volume to the aggregate store.
Flink only emits per-minute counters. The dashboard API computes "last hour" by summing the last 60 minute-buckets, "today" by summing 1440. Pros: 1/3 the state, simpler windowing. Cons: every dashboard read does a fan-out sum — fine in Redis (60 ZSCOREs are sub-ms), problematic at higher granularity.
The production choice is hybrid: Flink emits per-minute and per-day counters; the dashboard rolls up per-hour from per-minute (60 cells) and "all-time" from per-day. Per-minute is what dashboards refresh against; per-day is what survives forever for historical reporting. Per-hour is just a sum over 60 cells — cheap.
Picture a Super Bowl ad that gets 100,000 clicks per second for 60 seconds. Our Kafka topic is partitioned by ad_id, so all 100K clicks land on the same partition. One Flink task handles that partition. That single task is now doing 100x its share of work, while the other 9,999 partitions are nearly idle.
This is the classic hot key problem in stream processing. Two solutions, both used in production at different scales.
Instead of partitioning by ad_id, partition by (ad_id, random_salt 0-99). Now a single ad's clicks fan out across 100 partitions. Flink aggregates per (ad, salt) in parallel, then a second step merges the 100 sub-counts into the final per-ad counter. Pros: scales linearly. Cons: requires a two-stage pipeline (sub-aggregate, then re-aggregate).
Local pre-aggregation at the ingestion API layer — buffer clicks for 1 second, send {ad_id, count: N} instead of N individual events. Reduces Kafka volume 100x for hot ads at the cost of 1s extra dashboard lag. Effectively the same idea as a database write-back cache.
ad_id partitioning is fine. Auto-scale by monitoring per-partition lag — when one partition's lag exceeds a threshold, dynamically rekey it with salt. Most days, no salting is needed; on Super Bowl day, the system shifts gear automatically.Why have two aggregation paths? Because the real-time dashboard and the invoice serve different masters.
Dashboards refresh every minute. Drift of ±0.1% is fine — Sarah is monitoring trend, not balancing books. We trade exactness for freshness. Counts are eventually correct as Flink catches up.
Daily Spark job over immutable S3 events. Reproducible — run it again next year and you get the same number. This is what the advertiser is charged. Slow (2 hours) but provably exact.
At end of day, compare the real-time number against the canonical number per ad. Acceptable drift is <0.5%. If drift exceeds 1%, page on-call — something in the streaming pipeline is misbehaving (Flink restart that lost late events, Redis eviction during a memory pressure spike, etc.). The drift dashboard is itself a critical piece of operational tooling.
"as_of" timestamp and an "estimated" label. Yesterday's row, after the nightly reconciliation, flips its label to "final". Advertisers have learned to trust the contrast — "today's number is approximate, yesterday's is invoice-grade".Click fraud is real and expensive. Bot farms, click rings, competitors burning a rival's daily budget — every ad network deals with it. Our fraud detection runs as a side-branch off Kafka, not inline with counting, so the main pipeline stays fast and the fraud rules can be updated without redeploying the aggregator.
Same IP fires more than N clicks per minute across any ads. Threshold tuned by ASN — a corporate NAT looks different from a residential ISP.
Same user_id clicks the same ad_id more than N times per hour. A real user clicks an ad once or twice; a bot clicks it 200 times.
Click timestamps too perfectly spaced (every 4.2 seconds, ±0.1s). Real human clicks are jittery; bots are not.
Headless browsers, outdated mobile UAs, missing referrer headers. Each is weak alone; correlated across an IP they're a strong signal.
Click claims to be from a US user but IP is a known data-center range in Eastern Europe. Auto-flag for review.
A campaign suddenly gets 10x more clicks than baseline but zero conversions. Either incredibly bad ad copy or a click-fraud attack. Alert the campaign owner.
Flagged idempotency_keys are written to a fraud Kafka topic. The billing reconciliation job ⑫ reads this topic and subtracts flagged clicks from the canonical invoice. Real-time dashboards can also surface a "suspected fraud" overlay so Sarah sees gross vs net counts.
Each storage layer has its own partitioning scheme — chosen to match the access pattern of that layer, not the layer above or below.
| Layer | Partition key | Why |
|---|---|---|
Kafka clicks topic | ad_id | All clicks for one ad land on one Flink task — trivial in-task aggregation. Hot ads handled via salting (§9). |
| Flink state | (ad_id, window_start) | Counters are partitioned the same way Kafka is, so Flink's keyed state lines up with input partitions — no shuffle. |
| Aggregate Store (Redis) | ad_id (consistent hash) | Dashboard reads a single ad's series — that ad's data lives on one Redis shard. ZRANGE is local. |
| S3 Parquet | date=YYYY-MM-DD/hour=HH/ | Spark scans a day or hour at a time — date-prefix partitioning lets it skip everything else (partition pruning). |
| Idempotency Cache | idempotency_key hash | Even spread across Redis cluster nodes; SETNX is single-node, no cross-shard transactions needed. |
What survives what kind of failure? Walk through the failure scenarios that matter.
Browser sees a connection reset, retries with the same idempotency key. Redis SETNX on the second arrival returns 0, the click is dropped (correctly — it was already published before the crash). If the crash happened before Kafka publish, the key was never set in Redis either, so the retry succeeds normally.
3-replica replication: each partition has a leader and 2 followers across AZs. On leader loss, a follower is elected leader within seconds. In-flight produces retry against the new leader. Acked writes are durable because acks=all requires majority replication before ack.
Job manager detects the lost heartbeat, restarts the task on another node, restores state from the latest checkpoint, resumes Kafka consumption from the saved offset. 10s checkpoint interval means at most 10s of replay. State and offset roll back together — no double-counting.
Redis Cluster with replicas — failover in seconds. If the replica was lagging, the dashboard briefly shows slightly stale numbers; the next Flink checkpoint catches it up. The canonical billing number, sourced from S3, is unaffected.
Active-passive multi-region. Click endpoints in other regions stay up; Kafka MirrorMaker replicates the topic cross-region. The passive region's Flink job is hot-warm — promoted to active within minutes. Dashboard sees a brief gap; billing reconciliation backfills from the replicated S3.
Kafka holds 7 days of retention. If S3 sink falls behind, it catches up when S3 recovers — no data lost. Billing is delayed by hours, not lost. We'd page on-call but the system self-heals.
ad_id to (ad_id, salt 0–99) for that hot ad — the clicks now fan across 100 partitions instead of one. A second Flink stage merges the 100 sub-counters into the final per-ad number. Auto-detect via per-partition lag monitoring: when a partition's lag exceeds a threshold, dynamically rekey. In steady state no salting is needed; on Super Bowl day the system reshapes itself.variant_id column to the click event and partition Flink's aggregation by (ad_id, variant_id) instead of just ad_id. Now every counter is per-variant. The dashboard shows side-by-side conversion rates. For statistical significance, the same Flink job emits per-variant impression counts and the dashboard runs a chi-square test client-side. The reconciliation job also splits by variant, so canonical billing reflects which variant earned the spend. Zero new infrastructure — just a wider key.