From "one cron daemon polling Postgres" to a sharded, time-bucketed, lease-protected scheduler that fires 100K jobs/second on time, every time
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
Imagine 100 million users have all set the same calendar reminder — "ping me at 7:00 AM every weekday with my morning briefing." At exactly 07:00:00 local time, on a Monday, in 24 different timezones, those reminders must fire. Some are emails, some are push notifications, some kick off a workflow that fans out to ten downstream services. Miss the moment by even a minute and the user opens the app to dead silence — "what was that thing I was supposed to do today?" That's the job scheduler problem: a system that triggers code at scheduled times (cron-style: every Monday at 9, every hour on the hour) and at delayed offsets (one-shot: "send Sarah a reminder email in 3 days") at massive scale, reliably, on time.
You've used job schedulers without thinking about them. Quartz fires off Java tasks. Linux cron fires shell scripts. Airflow runs data pipelines. Celery beat schedules background workers. They're all the same shape underneath — a clock, a queue, and a worker — but they break in interesting ways the moment you ask them to do it for a billion jobs across hundreds of machines.
07:00:00, how do we fire them within seconds of their target time without melting the database? (2) When a worker crashes mid-execution at 07:00:30 with 50,000 jobs in-flight, how do we make sure those jobs still fire — exactly once — and not zero times or twice?Before drawing a single box, pin down what the system must do. Naming the asks early prevents the classic interview trap of designing for a workload nobody asked for.
Numbers force decisions. Three knobs matter: how many jobs are pending in the system right now, how fast they fire at peak, and how big each job's metadata is.
Assume 1 billion active scheduled jobs at any instant (a mix of one-shot timers and the next-fire entries of cron jobs). Peak firing rate during global rush hours: 100,000 jobs/second.
~1B active
across all users / cron entries
~100K jobs/sec
at 07:00 sharp across regions
~500 bytes
id + run_at + target + payload
~500 GB
1B × 500B for current jobs
The hot store (DynamoDB / Cassandra) holds all 1B currently-pending jobs at 500 bytes each — about 500 GB. The history archive (S3 / Parquet) holds every job that has ever fired, for audit and analytics, growing roughly 3 TB / month at 100K fires/sec. Cold storage is cheap; we leave it indefinitely.
| Metric | Value | Why it matters |
|---|---|---|
| Pending jobs | 1B | Forces sharding of the cold DB — too big for one node |
| Peak fire rate | 100K/s | Forces N partitions, each handling a slice; no single worker can keep up |
| Storage hot | 500 GB | DynamoDB scales linearly; partition by job_id hash |
| Storage history | 3 TB/mo | S3 / cold tier; fire-and-forget audit log |
| Hot ZSET window | ~1 hour | Holds at most 100K/s × 3600 = 360M jobs in the worst case; sharded to fit |
Four endpoints carry everything. Notice that POST /jobs is overloaded — same endpoint, two flavors of payload (one-shot vs cron). The schema discriminates.
// Create a one-shot delayed job POST /api/v1/jobs { "type": "one_shot", "run_at": "2027-05-10T14:02:00Z", // exact UTC instant "target": { "kind": "http_webhook", "url": "https://api.example.com/reminder", "method": "POST", "headers": { "Authorization": "Bearer ..." }, "body": { "user_id": 42, "msg": "stand-up in 5" } }, "idempotency_key": "remind-user-42-standup-monday", "max_retries": 5 } → 201 Created { "job_id": "job_01HXY...", "status": "scheduled" } // Create a recurring cron job POST /api/v1/jobs { "type": "cron", "cron_expr": "0 9 * * MON", // every Monday 9 AM "timezone": "America/Los_Angeles", "target": { ... }, "idempotency_key_template": "weekly-standup-{run_at}" } → 201 Created { "job_id": "job_01HXZ...", "next_run_at": "2027-05-12T16:00:00Z" } // Cancel a pending job POST /api/v1/jobs/:id/cancel → 200 OK { "status": "cancelled" } // Query a job's state GET /api/v1/jobs/:id → 200 OK { "job_id": "...", "status": "scheduled|fired|failed|cancelled", "next_run_at": "...", "last_fired_at": "...", "attempts": 0 } // Hard delete (rare; usually cancel is enough) DELETE /api/v1/jobs/:id → 204 No Content
POST /jobs, the client retries — and a naive system creates a duplicate job. Same key → server returns the existing job_id instead of inserting again. Same idea protects against duplicate execution downstream: the target service uses the key to dedup if our scheduler fires the job twice (which can happen in the lease-expiration path, see §9).cancel instead of delete as the default: a job that's about to fire in 200ms is already in the hot store, possibly already popped by a worker. A delete that races the worker is hard to reason about. Cancel just sets a flag; the worker checks the flag right before invoking the target and skips. Strictly safer.This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies its existence. Numbers from §3 drive every decision.
Sketch the simplest system: one cron daemon polling a Postgres table every minute. The table has a run_at column. Every 60 seconds the daemon runs SELECT * FROM jobs WHERE run_at < NOW() AND status = 'scheduled', fires each row, and marks it done.
Four concrete failures emerge the moment real traffic shows up:
One daemon firing webhooks serially tops out around 100 fires/sec on a single core. We need 100,000/sec. Even with a thread pool of 1000, you're now hammering Postgres with 100K concurrent updates per second to mark jobs done — the WAL fsync becomes the bottleneck and the daemon falls behind real-time.
100 million users have a 7am reminder. At 07:00:00 the SELECT returns 100M rows. The daemon tries to process them all at once. Postgres pegs at 100% CPU returning the result set, the daemon's heap explodes, downstream HTTP targets get slammed by 100M synchronous calls in seconds. Everything queues, everything backs up, jobs that were supposed to fire at 07:00 actually fire at 07:14.
The daemon process dies (OOM, hardware failure, deploy gone wrong). Until someone notices and restarts it, zero jobs fire. Your billing reminders, your "expire trial" jobs, your nightly batch — all silent. Discovering this is a 2 AM page.
SELECT WHERE run_at < NOW() ORDER BY run_at LIMIT 10000 over 1 billion rows requires an index on run_at, but with billions of rows the index itself is gigabytes. Index scans take seconds. Each poll cycle eats the full poll interval just doing the SELECT. We can't poll faster than the SELECT takes — so we can't drive timing precision below ~minutes.
Every problem in Pass 1 maps to one of four insights. Internalize them and the production shape draws itself.
Instead of one giant jobs table sorted by run_at, partition jobs by their run minute. Finding "what's due in the next minute" reads one bucket — O(1) regardless of total job count. Think of pigeonhole mailboxes labeled by date: the postman doesn't search a pile, they walk to today's slot. Billions of pending jobs become irrelevant; only the current minute's bucket matters for the scheduler loop.
One process can't fire 100K/sec. Hash each job_id into one of N partitions; each scheduler worker owns a subset of partitions and fires only its slice. With N=100 workers, each handles 1K jobs/sec — easy. Adding capacity = adding workers. The shard count and worker count scale independently, which is the whole point of horizontal scaling.
When a worker pops a job from its slice, it sets a lease in Redis: "I own job_X for the next 5 minutes." If the worker crashes mid-execution, the lease expires and another worker can pick up the same job. Combined with an idempotency key on the target, this gives effectively-exactly-once semantics: jobs never get lost (the lease ensures retry), and double-fires are absorbed by the target's idempotency check.
500 GB of pending jobs cannot fit in RAM, but the scheduler only needs the next hour's worth at any moment — that's millions, not billions. Keep everything in a cheap durable store (DynamoDB / Cassandra), and stage just the next 1–2 hours into a Redis sorted set. Schedulers query Redis at 100ms intervals (sub-ms response); the cold DB is only touched by background loaders. Result: precise timing on a small hot working set.
Together, these four insights kill all four Pass-1 failures: time-bucketing kills the slow SELECT, sharding kills the throughput ceiling, leased execution kills the single-process-failure problem, and two-tier storage kills the index-size problem.
Now the full picture. Every node is numbered — find its matching card below to see what it does and what would break without it.
Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.
Anything that wants something to happen later — a calendar app saying "remind me at 7am", a billing service saying "expire this trial in 14 days", a workflow engine saying "retry this step in 5 minutes". Clients call our REST API; from their view, the entire scheduler is two endpoints and a webhook that fires on time.
Solves: nothing on its own — but every design decision flows backward from "what does the client expect?" Specifically: precise timing, durability, and the ability to cancel.
Stateless service that handles POST /jobs, GET /jobs/:id, POST /jobs/:id/cancel. Validates input (cron expression syntax, run_at in the future, target URL well-formed), computes the canonical next_run_at, and persists to the Job Store. Returns 201 with the job_id. Total budget: under 50ms.
Solves: a clean stateless write tier that scales horizontally behind a load balancer. Without an isolated API tier, the scheduler workers would have to handle creation traffic too — coupling write spikes to firing capacity, exactly the wrong dependency.
The durable source of truth. Holds every pending and recently-fired job. Sharded by hash(job_id) across many nodes, replicated 3× per shard. Schema includes a secondary index on (partition_id, run_at_minute) so the Bucket Loader can ask "give me all jobs in partition 5 due in the next hour" with a fast range query.
Solves: durability and scale. 1 billion jobs cannot live in RAM and cannot live on one disk. NoSQL gives us automatic sharding and 3× replication. Without it, accepting a job and then losing it on a node failure would violate the system's core promise — once accepted, it must fire.
An in-memory sorted set per partition, with members = job_ids and scores = unix timestamp of run_at. Holds only jobs due in the next 1–2 hours. Scheduler workers query ZRANGEBYSCORE bucket_p5 -inf NOW in sub-millisecond time to find what's due. Replicated for failover; sharded by partition so no single Redis node holds the whole working set.
Solves: precise timing. A SELECT against DynamoDB takes 5–20ms — fine for occasional queries, fatal at 10 polls/second per worker. The Redis ZSET serves the same query in 0.5ms, letting workers poll every 100ms without overloading anything. Think of it as the kitchen's "next 30 orders" board, while the Job Store is the back-office customer database.
A background service that runs every 10 minutes. For each partition, it queries the Job Store for jobs whose run_at falls in the next 1 hour and copies them into the corresponding Hot Bucket ZSET. Idempotent — re-running a load just refreshes existing entries. Acts as the bridge between the cold and hot tiers.
Solves: staging the right working set into RAM without overloading the cold DB. Without the loader, scheduler workers would have to query DynamoDB directly on every tick — at 100K queries/sec the read cost alone would dominate. With the loader, the cold DB is read in calm batches every 10 min; workers only touch the cheap, fast hot tier.
The clock-watchers. Each worker owns a slice of partitions. Every 100ms it does ZRANGEBYSCORE on its hot buckets to find jobs whose run_at <= now. For each due job it: sets a lease in Redis (TTL 5 min), publishes the job to the Execution Queue, and ZREMs it from the hot bucket. The worker does not invoke the target itself — that's the executor's job. Keeping fire-detection separate from fire-execution prevents slow targets from blocking the clock loop.
Solves: the throughput ceiling and the precision problem. Without sharded workers, one process tries to fire 100K/sec and falls behind. Without the fire-detection / fire-execution split, one slow webhook would block the worker from noticing the next 1000 due jobs.
A consensus service that maintains the partition→worker assignment. When a worker joins or dies, the leader rebalances partitions among the survivors within seconds. Workers heartbeat to ZooKeeper; missed heartbeats trigger reassignment. Without this, two workers could both think they own partition 5 and double-fire every job in it.
Solves: ownership clarity in a dynamic cluster. Workers come and go (deploys, crashes, scale-outs) — partitions must always have exactly one owner. The librarian model: the head librarian decides which clerk is responsible for which shelf; if a clerk goes home sick, the head librarian reassigns their shelves to others before customers notice.
A durable buffer between scheduler workers (which detect "this job is due now") and executor workers (which actually invoke the target). Each "due now" event becomes a message; partitioned by job_id so a single job's events land on the same Kafka partition (preserves ordering for retries). Buffers absorb spikes — at 07:00 sharp, scheduler workers may dump millions of "fire now" events into Kafka in seconds, which executors then drain at their own pace.
Solves: isolating the clock from the network. If executors talked to scheduler workers synchronously, a slow executor would back-pressure into the scheduler and the clock would drift. Kafka decouples them and lets each scale independently.
A separate worker pool that consumes from the Execution Queue. For each message: check the Idempotency Store to make sure this job hasn't already been fired; invoke the target (HTTP POST or enqueue to a downstream Kafka/SQS topic); on success, write to the Audit Log and release the lease; on failure, hand the job to the Retry Manager. Pool sized for peak concurrent invocations — typically thousands of pods.
Solves: the "fire" half of fire-detection-and-fire-execution. Separating it from scheduler workers means we can scale executor capacity (thousands of pods making slow HTTP calls) independently from scheduler capacity (a few hundred fast pollers).
When an executor fails to invoke a target (HTTP 5xx, timeout, downstream queue full), it hands the job here. Retry Manager computes the next retry time using exponential backoff with jitter — now + base × 2^attempt + random_jitter — and re-inserts the job into the Job Store with the new run_at. After max_retries, the job goes to the dead-letter queue and an alert fires.
Solves: transient failure resilience. Networks are flaky; downstream services have hiccups. Without retries, a single transient blip means the user's reminder is gone forever. With backoff + jitter, we ride out short outages without melting the recovering target with a stampede.
A short-TTL key-value store keyed by execution_id = (job_id, scheduled_run_at). Before invoking a target, the executor does SETNX execution_id "in_progress". If the key already exists with status "done", skip — this is a duplicate fire from a lease-expiration retry. The TTL is generous (24 hours) so even slow retries are deduped.
Solves: the "duplicate fire" half of effectively-exactly-once. The lease mechanism ensures jobs always fire at least once, even if a worker crashes; the idempotency store ensures they fire at most once observable to the target. Together they approximate exactly-once.
Every fire, retry, success, failure, and cancellation appends a row to a durable log (DynamoDB time-series table or S3 + Parquet). Used for the GET /jobs/:id history view, post-mortem analysis ("why did Sarah's reminder fire 4 minutes late?"), and compliance audits ("prove this billing event fired").
Solves: observability and compliance. Without an audit trail, debugging a stuck job is a nightmare and there's no defense in a regulator's "show me when this billing notification fired" demand.
Two real scenarios, mapped to the numbered components above:
{run_at: "2027-05-10T14:02:00Z" (+3 days), target: webhook-url} to the Job API Server ②.job_id = job_01HXY..., hashes to partition_5, INSERTs into Job Store ③ with run_at = T+3d. Returns 201. ~40ms.bucket_p5 with score = unix(run_at).ZRANGEBYSCORE bucket_p5 -inf NOW, sees Sarah's job is due, sets a 5-minute lease in Redis, publishes a "fire" message to Execution Queue ⑧, ZREMs the entry. ~2ms total.execution_id as done in idempotency store, writes "fired ok" to Audit Log ⑫, releases the lease. Total wall-clock from 14:02:00 to fire: ~150ms."in_progress", starts the HTTP call to Sarah's webhook…"in_progress"."in_progress" with TTL still alive, but the original holder is gone. Per policy, treat as safe to retry — overwrite the key and invoke. Sarah's webhook is idempotent (it checks her message-id), so the second call is a no-op if the first actually went through, or fires fresh if it didn't.The single most important storage idea in the entire design. Instead of one giant table with a run_at index, partition jobs by their fire-minute. Querying "what's due in the next minute" reads one bucket, regardless of how many billions of total jobs exist.
Every job's primary partition key in the Job Store is (partition_id, run_at_minute). partition_id = hash(job_id) % N spreads load across shards; run_at_minute = the unix-minute timestamp of the scheduled fire time, truncated. A query like "all jobs in partition 5 due between 14:00 and 15:00" scans 60 contiguous bucket entries — extremely fast, regardless of whether the table holds 1 million or 1 billion total jobs.
Per-minute buckets are the sweet spot for most workloads. Per-second buckets create 60× more buckets (3600× when you factor in the hour-window scan) — too many small reads. Per-hour buckets dump 6M+ jobs into one read at peak — too coarse, defeats the point. Some systems use per-second for the "next 5 minutes" hot window and per-minute for further out — a hybrid.
Imagine a wall of mailboxes labeled by date. The postman doesn't search a giant pile of letters every morning — they walk to today's box, pull out exactly the letters in there, deliver them. Time-bucketing is the same: the scheduler doesn't search "all jobs ever", it reads "today's box" (or this minute's). The size of the rest of the wall is irrelevant.
SELECT * FROM jobs WHERE run_at < NOW() over 1B rows takes ~2 seconds even with an index — too slow to poll every 100ms. With per-minute bucketing, fetching the current minute's bucket is ~5ms — fast enough to poll every 100ms with capacity to spare.500 GB of pending jobs cannot fit in RAM, but the scheduler doesn't need them in RAM — it only needs the next hour's worth. The two-tier pattern keeps everything cheap and durable, while making the part that matters at any given moment fast.
The complete pending-jobs table. Sharded by hash(job_id), replicated 3×. Holds all 1B jobs no matter how far in the future. Optimized for durability and write throughput; query latency 5–20ms is fine because we only read it during background loading or rare individual-job lookups.
An in-memory sorted set per partition, scored by unix-second of run_at. Holds only jobs due in the next 1–2 hours. ZRANGEBYSCORE in 0.5ms gives us "what's due now". Sharded across Redis nodes so no single node holds all 100K+ jobs/sec worth of working set.
Every 10 minutes per partition, the Bucket Loader scans the cold DB for jobs due in the next 1 hour and inserts them into the corresponding hot ZSET. Idempotent — re-runs are safe. The loader uses the partition's secondary index (partition_id, run_at_minute) for a fast range query; it never does a full table scan.
One worker can't fire 100K jobs/sec. We fan out to N workers, each owning a slice. The trick is making sure each partition has exactly one owner at any moment — two owners means double-firing, zero owners means missed firings.
Choose M partitions where M > expected worker count (say M=1024 logical partitions, with N=100 actual workers). Each worker owns ~10 partitions. Adding more workers means redistributing partitions — never reshuffling jobs. ZooKeeper or etcd holds the canonical partition→worker map.
Workers heartbeat every 5s. Two missed heartbeats (10s) → ZooKeeper marks the worker dead and reassigns its partitions to surviving workers. The dead worker's leases expire naturally over the next 5 minutes; surviving workers re-fire any in-flight jobs whose leases lapse. Total recovery time: ~10 seconds of "no firing" for that partition's slice, then back to normal.
Each partition has its own Redis key (bucket_p0, bucket_p1, …, bucket_p1023). A worker only touches the keys for partitions it owns — no cross-talk, no cross-shard transactions. Redis cluster mode automatically distributes these keys across Redis nodes by hash slot, so the working set scales horizontally without manual sharding.
Truly exactly-once execution across a network is provably impossible (the Two Generals problem). What we can build is "at-least-once + idempotent target = effectively exactly once observable to the user". Leases are the mechanism that gets us the at-least-once half safely.
A lease is like checking a book out from the library with a 5-minute due date. While you have it, nobody else can take it. If you bring it back, fine. If you wander off and never return — after 5 minutes the librarian declares it lost and lets someone else have it. The lease in Redis with a TTL is exactly this: a finite-duration claim that auto-releases on timeout.
The lease guarantees at-least-once: a job will fire at least once even if executors crash. It can't guarantee at-most-once because the executor may have completed the HTTP call and crashed before deleting the lease — the next executor will re-fire. That's why the target service must use the idempotency_key (or the execution_id) to dedup. Without an idempotent target, you get duplicate emails, duplicate billings, duplicate everything — and there's no fix at our layer.
Recurring jobs reuse the entire one-shot machinery. The trick: store the cron expression as a template, and convert each occurrence into a one-shot job.
When a client creates a cron job, we store: {cron_expr, timezone, target, template_id}. We compute next_run_at from the cron expression and create one one-shot job pointing at that time. When that one-shot fires successfully, the executor computes the next run_at and creates a fresh one-shot. The cron job is always represented as "exactly one upcoming one-shot".
"Every minute forever" would create infinite rows. Lazy generation keeps the table bounded — at most one pending occurrence per cron template at any time. Trade-off: if a cron job is paused for 3 days, when it resumes you decide policy: fire once for the missed slot, or skip back to "now".
Use a battle-tested cron parser (cron-utils in Java, croniter in Python). It accepts standard cron syntax plus extensions (0 9 * * MON, @daily, 0 0 1 * *). Always store the original expression and the user's timezone — compute next_run by converting "now in user TZ" forward, then converting back to UTC for storage. This handles DST transitions correctly.
Targets fail. Networks blip. Downstream services have hiccups. Retry policy must absorb transient failures without melting a recovering target with a stampede.
delay = base × 2^attempt. With base = 30s: attempts at 30s, 1m, 2m, 4m, 8m. Bounded by max_attempts (typically 5–10) before dead-letter. Spreads retries over many minutes so a struggling target gets time to recover.
Add ±25% random noise: delay × (0.75 + random()). Without jitter, all jobs that failed at 14:02 retry simultaneously at 14:02:30 — same thundering-herd problem we just solved. Jitter spreads the retries across a window.
After max_retries, the job moves to a DLQ topic with the failure history. An on-call alert fires. Operators inspect, fix the target, and either replay the DLQ or accept the loss. Critical for "this billing event simply must fire" semantics.
The Retry Manager doesn't keep retries in memory. On failure, it computes the next run_at and writes a fresh row to the Job Store via the same POST /jobs path internally. This way retries flow through the same hot/cold/lease machinery as fresh jobs — no special code path. The retry count and original_job_id ride along as metadata.
"I changed my mind — don't fire that reminder I scheduled for 7am." Easy when the fire is days away; tricky when it's 2 seconds away and the job is already in the hot bucket, possibly already popped by a worker.
POST /jobs/:id/cancel sets status = "cancelled" in the Job Store. We do NOT race-delete from the hot ZSET — that's flaky. Instead, when the Scheduler Worker pops the job from its bucket, it reads the job row and checks the cancelled flag. If cancelled, it skips firing and writes "cancelled" to the audit log. Worst case, the job is "popped and ignored" instead of "popped and fired".
Cancel arrives at 14:01:58 for a job firing at 14:02:00. At 14:02:00 the worker pops it; reads the cancelled flag; skips. Total latency from cancel-API to worker-skip: typically <200ms. If the cancel arrives during executor invocation (after worker pops, before HTTP call returns), it's too late — the job already fired. The contract: cancel is best-effort up to the moment of fire; once fired, you must compensate downstream.
"Fire within a few seconds" works for most use cases — emails, push notifications, batch jobs. Some workloads demand tighter precision: financial markets opening at 09:30:00.000, gameplay events tied to a global clock, ad auction windows. The architecture supports tiered precision.
Default. Scheduler workers poll the hot ZSET every 1 second. Fire time = scheduled_run_at + 0–1s polling jitter + 50–200ms execution. End-to-end variance: under 2 seconds. Costs almost nothing.
Same architecture, faster polling: scheduler workers poll every 100ms. End-to-end variance: under 250ms. Costs 10× more Redis CPU on the polling, but otherwise unchanged. Used for paid tier customers.
Different machinery: dedicated low-jitter executor pods using Linux timerfd (kernel-level timer interrupt) for the final hop. The scheduler hands off the job 1 second before fire time; the timerfd executor then waits for the exact instant and fires. Jitter: under 1ms. Used for financial / market-open workloads. Costs significantly more.
Fire-and-forget systems are easy. Fire-on-time-or-explain-why is hard. Every component must survive its own failure modes; together they must survive the failures we can't predict.
DynamoDB / Cassandra replicate every write to 3 nodes across 3 AZs. Quorum reads/writes (R=2, W=2) survive any single-node or single-AZ outage. The cold tier is the source of truth; even if every Redis node and every worker dies, the cold tier still holds every pending job.
Redis runs in cluster mode with master + 1 replica per shard. On master failure, replica is promoted in seconds. If the entire hot tier is lost, the Bucket Loader rebuilds it from cold storage on its next sweep — bounded data loss, no jobs missed (cold tier is intact).
Both Scheduler Workers and Executor Workers carry no state across restarts. State lives in Redis (lease, idempotency) and Job Store (canonical job rows). Killing and restarting a worker pod is a non-event — within seconds, ZooKeeper reassigns its partitions and another pod picks up the slack.
3 or 5 ZooKeeper nodes form a quorum. As long as a majority survive, partition assignment continues to work. If the entire ZK cluster dies, no new partition assignments happen; existing workers keep running on their last-known assignments. Recovery: bring ZK back, workers re-register, normal flow resumes.
If AZ-A goes dark — every worker, every Redis master, every API server in AZ-A vanishes — what happens? (1) ALB stops routing to AZ-A. (2) ZK promotes ZK followers from B/C, partitions reassigned to surviving workers. (3) Redis replicas in B/C promote to masters. (4) DynamoDB serves reads from B/C replicas. (5) Within 30–60 seconds the system is fully operational from B+C. Pending jobs scheduled during the outage may fire 30–60s late but none are lost.
For latency, compliance, and disaster tolerance, run independent scheduler stacks per region. Most jobs stick to their home region; a small number of "global cron" jobs need cross-region coordination.
Jobs are pinned to a region at creation time, usually based on the user's home region or a region tag in the API call. Each region has its own Job Store, hot tier, scheduler workers, and executors — completely independent. No cross-region traffic on the fire path → low latency, no cross-region failure coupling.
"Run this job once globally at 09:00 UTC" — only one region should fire it. Designate a primary region for the cron template (us-east-1). The primary region creates the next one-shot occurrence; if that region is down, a standby region detects it via cross-region heartbeat and takes over. Costs occasional cross-region replication of the template registry (small, low-traffic), keeps the fire path in one region.
ZREMed (depends on where exactly the crash happened). Another worker fires it. The target dedup via execution_id = (job_id, scheduled_run_at); if the original call actually went through, the second is a no-op. Net result: at-least-once delivery + idempotent target = the user observes exactly one fire.hash(job_id) across DynamoDB partitions — 1B jobs at 500B each = 500GB, trivial for DynamoDB to spread across nodes. (2) Hot tier sharded across Redis cluster nodes by partition_id — at most 100K/s × 3600s = 360M jobs in the 1-hour window, spread across ~50 Redis shards. (3) Workers scaled horizontally — N workers each owning M/N of the 1024 logical partitions, add more workers to grow throughput. Nothing in the architecture is a vertical bottleneck; every layer scales by adding capacity.status = "cancelled" in the Job Store. We do not race-delete from Redis — that's flaky and creates inconsistency. Instead, when the scheduler worker pops the job for firing, it reads the canonical row from Job Store (one extra ~5ms read) and checks the flag; if cancelled, it skips and writes "cancelled" to the audit log. The window of "too late to cancel" is tiny — only between worker-pops-and-publishes-to-Kafka and executor-actually-fires, typically <200ms. If you cancel before that window, you're guaranteed safe; inside it, best-effort.execution_id in the idempotency store + idempotent target → dedup on the consumer side. The literature calls this "effectively exactly-once" — every production scheduler (Quartz cluster, Airflow, AWS EventBridge, GCP Scheduler) takes this trade.hash(job_id) % N, the 09:00 firings are spread evenly across all N partitions — each shard sees only 1/N of the load, no matter how time-skewed the scheduling distribution is. The time bucket within each partition is still per-minute, so reads are O(1), but the total fire load at 09:00 is amortized across the whole cluster.