Distributed Job Scheduler

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →

Step 1

What is a Job Scheduler?

Imagine 100 million users have all set the same calendar reminder — "ping me at 7:00 AM every weekday with my morning briefing." At exactly 07:00:00 local time, on a Monday, in 24 different timezones, those reminders must fire. Some are emails, some are push notifications, some kick off a workflow that fans out to ten downstream services. Miss the moment by even a minute and the user opens the app to dead silence — "what was that thing I was supposed to do today?" That's the job scheduler problem: a system that triggers code at scheduled times (cron-style: every Monday at 9, every hour on the hour) and at delayed offsets (one-shot: "send Sarah a reminder email in 3 days") at massive scale, reliably, on time.

You've used job schedulers without thinking about them. Quartz fires off Java tasks. Linux cron fires shell scripts. Airflow runs data pipelines. Celery beat schedules background workers. They're all the same shape underneath — a clock, a queue, and a worker — but they break in interesting ways the moment you ask them to do it for a billion jobs across hundreds of machines.

The two questions that drive every design decision below: (1) When 100 million jobs all come due at 07:00:00, how do we fire them within seconds of their target time without melting the database? (2) When a worker crashes mid-execution at 07:00:30 with 50,000 jobs in-flight, how do we make sure those jobs still fire — exactly once — and not zero times or twice?

Step 2

Requirements & Goals

Before drawing a single box, pin down what the system must do. Naming the asks early prevents the classic interview trap of designing for a workload nobody asked for.

✅ Functional Requirements

Schedule a one-shot job at an exact future time — "fire at 2027-01-15T07:00:00Z"
Schedule a recurring job via cron expression — "every Monday at 9am"
Trigger an HTTP webhook or enqueue to a message broker (Kafka/SQS) when the job fires
Retry on failure with exponential backoff; dead-letter after N attempts
Cancel a pending job before it fires; query job state by ID

⚙️ Non-Functional Requirements

Exactly-once execution — or at-least-once with idempotency keys (truly exactly-once is provably impossible across the network)
Precise timing — fire within a few seconds of the scheduled instant; sub-second for premium tiers
Massive scale — billions of pending jobs, tens of thousands firing per second at peak
Highly available — no single machine death may stop or delay job firing
Durable — once accepted, a job must fire (or end up in a dead-letter queue with an alert)

The hard non-functional requirements collide. Precise timing pushes you toward in-memory data structures; durability pushes you toward disk. Exactly-once pushes you toward distributed locks; high availability pushes you away from them. The architecture below is the practical compromise: a two-tier store (cold disk for everything, hot memory for what's about to fire) with leased execution for fault-tolerant near-exactly-once semantics.

Step 3

Capacity Estimation & Constraints

Numbers force decisions. Three knobs matter: how many jobs are pending in the system right now, how fast they fire at peak, and how big each job's metadata is.

Traffic estimates

Assume 1 billion active scheduled jobs at any instant (a mix of one-shot timers and the next-fire entries of cron jobs). Peak firing rate during global rush hours: 100,000 jobs/second.

Pending jobs

~1B active

across all users / cron entries

Peak fire rate

~100K jobs/sec

at 07:00 sharp across regions

Storage / job

~500 bytes

id + run_at + target + payload

Hot DB

~500 GB

1B × 500B for current jobs

Storage breakdown

The hot store (DynamoDB / Cassandra) holds all 1B currently-pending jobs at 500 bytes each — about 500 GB. The history archive (S3 / Parquet) holds every job that has ever fired, for audit and analytics, growing roughly 3 TB / month at 100K fires/sec. Cold storage is cheap; we leave it indefinitely.

Metric	Value	Why it matters
Pending jobs	`1B`	Forces sharding of the cold DB — too big for one node
Peak fire rate	`100K/s`	Forces N partitions, each handling a slice; no single worker can keep up
Storage hot	`500 GB`	DynamoDB scales linearly; partition by job_id hash
Storage history	`3 TB/mo`	S3 / cold tier; fire-and-forget audit log
Hot ZSET window	`~1 hour`	Holds at most 100K/s × 3600 = 360M jobs in the worst case; sharded to fit

Step 4

System APIs

Four endpoints carry everything. Notice that POST /jobs is overloaded — same endpoint, two flavors of payload (one-shot vs cron). The schema discriminates.

REST API surface

// Create a one-shot delayed job
POST /api/v1/jobs
{
  "type":         "one_shot",
  "run_at":       "2027-05-10T14:02:00Z",   // exact UTC instant
  "target": {
    "kind":       "http_webhook",
    "url":        "https://api.example.com/reminder",
    "method":     "POST",
    "headers":    { "Authorization": "Bearer ..." },
    "body":       { "user_id": 42, "msg": "stand-up in 5" }
  },
  "idempotency_key": "remind-user-42-standup-monday",
  "max_retries":  5
}
→ 201 Created  { "job_id": "job_01HXY...", "status": "scheduled" }

// Create a recurring cron job
POST /api/v1/jobs
{
  "type":           "cron",
  "cron_expr":      "0 9 * * MON",          // every Monday 9 AM
  "timezone":       "America/Los_Angeles",
  "target":         { ... },
  "idempotency_key_template": "weekly-standup-{run_at}"
}
→ 201 Created  { "job_id": "job_01HXZ...", "next_run_at": "2027-05-12T16:00:00Z" }

// Cancel a pending job
POST /api/v1/jobs/:id/cancel
→ 200 OK  { "status": "cancelled" }

// Query a job's state
GET /api/v1/jobs/:id
→ 200 OK  {
    "job_id": "...", "status": "scheduled|fired|failed|cancelled",
    "next_run_at": "...", "last_fired_at": "...", "attempts": 0
  }

// Hard delete (rare; usually cancel is enough)
DELETE /api/v1/jobs/:id
→ 204 No Content

Why an idempotency key is in the API contract: if the network drops the response from a successful POST /jobs, the client retries — and a naive system creates a duplicate job. Same key → server returns the existing job_id instead of inserting again. Same idea protects against duplicate execution downstream: the target service uses the key to dedup if our scheduler fires the job twice (which can happen in the lease-expiration path, see §9).

Why cancel instead of delete as the default: a job that's about to fire in 200ms is already in the hot store, possibly already popped by a worker. A delete that races the worker is hard to reason about. Cancel just sets a flag; the worker checks the flag right before invoking the target and skips. Strictly safer.

Step 5 · CORE

High-Level Architecture — From Naive to Production

This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies its existence. Numbers from §3 drive every decision.

Pass 1 — The naive design (and why it breaks)

Sketch the simplest system: one cron daemon polling a Postgres table every minute. The table has a run_at column. Every 60 seconds the daemon runs SELECT * FROM jobs WHERE run_at < NOW() AND status = 'scheduled', fires each row, and marks it done.

flowchart LR CL["Client"] --> API["API Server"] API --> PG[("Postgres jobs table")] CRON["Single Cron Daemon"] -->|"poll every 60s"| PG CRON -->|"fire"| TGT["Target — HTTP / Queue"]

Four concrete failures emerge the moment real traffic shows up:

💥 Doesn't scale past ~100 jobs/sec

One daemon firing webhooks serially tops out around 100 fires/sec on a single core. We need 100,000/sec. Even with a thread pool of 1000, you're now hammering Postgres with 100K concurrent updates per second to mark jobs done — the WAL fsync becomes the bottleneck and the daemon falls behind real-time.

💥 Thundering herd at 07:00 sharp

100 million users have a 7am reminder. At 07:00:00 the SELECT returns 100M rows. The daemon tries to process them all at once. Postgres pegs at 100% CPU returning the result set, the daemon's heap explodes, downstream HTTP targets get slammed by 100M synchronous calls in seconds. Everything queues, everything backs up, jobs that were supposed to fire at 07:00 actually fire at 07:14.

💥 Single daemon = single point of failure

The daemon process dies (OOM, hardware failure, deploy gone wrong). Until someone notices and restarts it, zero jobs fire. Your billing reminders, your "expire trial" jobs, your nightly batch — all silent. Discovering this is a 2 AM page.

💥 SELECT over billions of rows is slow

SELECT WHERE run_at < NOW() ORDER BY run_at LIMIT 10000 over 1 billion rows requires an index on run_at, but with billions of rows the index itself is gigabytes. Index scans take seconds. Each poll cycle eats the full poll interval just doing the SELECT. We can't poll faster than the SELECT takes — so we can't drive timing precision below ~minutes.

Pass 2 — The mental model: time-bucketed sharded scheduler with leased execution

Every problem in Pass 1 maps to one of four insights. Internalize them and the production shape draws itself.

🗂 Insight 1 — Time-bucket the storage

Instead of one giant jobs table sorted by run_at, partition jobs by their run minute. Finding "what's due in the next minute" reads one bucket — O(1) regardless of total job count. Think of pigeonhole mailboxes labeled by date: the postman doesn't search a pile, they walk to today's slot. Billions of pending jobs become irrelevant; only the current minute's bucket matters for the scheduler loop.

🔪 Insight 2 — Shard across N workers

One process can't fire 100K/sec. Hash each job_id into one of N partitions; each scheduler worker owns a subset of partitions and fires only its slice. With N=100 workers, each handles 1K jobs/sec — easy. Adding capacity = adding workers. The shard count and worker count scale independently, which is the whole point of horizontal scaling.

🔑 Insight 3 — Leased execution

When a worker pops a job from its slice, it sets a lease in Redis: "I own job_X for the next 5 minutes." If the worker crashes mid-execution, the lease expires and another worker can pick up the same job. Combined with an idempotency key on the target, this gives effectively-exactly-once semantics: jobs never get lost (the lease ensures retry), and double-fires are absorbed by the target's idempotency check.

🥶 Insight 4 — Two-tier hot/cold storage

500 GB of pending jobs cannot fit in RAM, but the scheduler only needs the next hour's worth at any moment — that's millions, not billions. Keep everything in a cheap durable store (DynamoDB / Cassandra), and stage just the next 1–2 hours into a Redis sorted set. Schedulers query Redis at 100ms intervals (sub-ms response); the cold DB is only touched by background loaders. Result: precise timing on a small hot working set.

Together, these four insights kill all four Pass-1 failures: time-bucketing kills the slow SELECT, sharding kills the throughput ceiling, leased execution kills the single-process-failure problem, and two-tier storage kills the index-size problem.

Pass 3 — The production shape

Now the full picture. Every node is numbered — find its matching card below to see what it does and what would break without it.

flowchart TB CL["① Client — App / Service / SDK"] subgraph API_PLANE["API Plane"] API["② Job API Server"] end subgraph STORAGE["Storage Plane"] JOBDB[("③ Job Store — DynamoDB / Cassandra")] HOT[("④ Hot Bucket Cache — Redis ZSET")] end subgraph SCHED["Scheduler Plane"] LOADER["⑤ Bucket Loader"] WORKER["⑥ Scheduler Workers — N partitions"] LEADER["⑦ Leader Election — ZooKeeper / etcd"] end subgraph EXEC["Execution Plane"] KAFKA[("⑧ Execution Queue — Kafka topic")] EXECW["⑨ Executor Workers"] RETRY["⑩ Retry Manager"] IDEMP[("⑪ Idempotency Store — Redis")] end subgraph OPS["Audit Plane"] AUDIT[("⑫ Audit Log — DB / S3")] end CL --> API API --> JOBDB LOADER -->|"every 10 min"| JOBDB LOADER --> HOT WORKER --> HOT WORKER --> KAFKA LEADER -.assigns partitions.-> WORKER KAFKA --> EXECW EXECW --> IDEMP EXECW -->|"on failure"| RETRY RETRY --> JOBDB EXECW -.fire / retry / done.-> AUDIT WORKER -.skipped if cancelled.-> JOBDB style CL fill:#e8743b,stroke:#e8743b,color:#fff style API fill:#171d27,stroke:#e8743b,color:#d4dae5 style JOBDB fill:#171d27,stroke:#38b265,color:#d4dae5 style HOT fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style LOADER fill:#171d27,stroke:#9b72cf,color:#d4dae5 style WORKER fill:#171d27,stroke:#4a90d9,color:#d4dae5 style LEADER fill:#171d27,stroke:#9b72cf,color:#d4dae5 style KAFKA fill:#171d27,stroke:#d4a838,color:#d4dae5 style EXECW fill:#171d27,stroke:#4a90d9,color:#d4dae5 style RETRY fill:#171d27,stroke:#e05252,color:#d4dae5 style IDEMP fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style AUDIT fill:#171d27,stroke:#d4a838,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.

① Client

Anything that wants something to happen later — a calendar app saying "remind me at 7am", a billing service saying "expire this trial in 14 days", a workflow engine saying "retry this step in 5 minutes". Clients call our REST API; from their view, the entire scheduler is two endpoints and a webhook that fires on time.

Solves: nothing on its own — but every design decision flows backward from "what does the client expect?" Specifically: precise timing, durability, and the ability to cancel.

② Job API Server

Stateless service that handles POST /jobs, GET /jobs/:id, POST /jobs/:id/cancel. Validates input (cron expression syntax, run_at in the future, target URL well-formed), computes the canonical next_run_at, and persists to the Job Store. Returns 201 with the job_id. Total budget: under 50ms.

Solves: a clean stateless write tier that scales horizontally behind a load balancer. Without an isolated API tier, the scheduler workers would have to handle creation traffic too — coupling write spikes to firing capacity, exactly the wrong dependency.

③ Job Store (DynamoDB / Cassandra)

The durable source of truth. Holds every pending and recently-fired job. Sharded by hash(job_id) across many nodes, replicated 3× per shard. Schema includes a secondary index on (partition_id, run_at_minute) so the Bucket Loader can ask "give me all jobs in partition 5 due in the next hour" with a fast range query.

Solves: durability and scale. 1 billion jobs cannot live in RAM and cannot live on one disk. NoSQL gives us automatic sharding and 3× replication. Without it, accepting a job and then losing it on a node failure would violate the system's core promise — once accepted, it must fire.

④ Hot Bucket Cache (Redis ZSET)

An in-memory sorted set per partition, with members = job_ids and scores = unix timestamp of run_at. Holds only jobs due in the next 1–2 hours. Scheduler workers query ZRANGEBYSCORE bucket_p5 -inf NOW in sub-millisecond time to find what's due. Replicated for failover; sharded by partition so no single Redis node holds the whole working set.

Solves: precise timing. A SELECT against DynamoDB takes 5–20ms — fine for occasional queries, fatal at 10 polls/second per worker. The Redis ZSET serves the same query in 0.5ms, letting workers poll every 100ms without overloading anything. Think of it as the kitchen's "next 30 orders" board, while the Job Store is the back-office customer database.

⑤ Bucket Loader

A background service that runs every 10 minutes. For each partition, it queries the Job Store for jobs whose run_at falls in the next 1 hour and copies them into the corresponding Hot Bucket ZSET. Idempotent — re-running a load just refreshes existing entries. Acts as the bridge between the cold and hot tiers.

Solves: staging the right working set into RAM without overloading the cold DB. Without the loader, scheduler workers would have to query DynamoDB directly on every tick — at 100K queries/sec the read cost alone would dominate. With the loader, the cold DB is read in calm batches every 10 min; workers only touch the cheap, fast hot tier.

⑥ Scheduler Workers (N partitions)

The clock-watchers. Each worker owns a slice of partitions. Every 100ms it does ZRANGEBYSCORE on its hot buckets to find jobs whose run_at <= now. For each due job it: sets a lease in Redis (TTL 5 min), publishes the job to the Execution Queue, and ZREMs it from the hot bucket. The worker does not invoke the target itself — that's the executor's job. Keeping fire-detection separate from fire-execution prevents slow targets from blocking the clock loop.

Solves: the throughput ceiling and the precision problem. Without sharded workers, one process tries to fire 100K/sec and falls behind. Without the fire-detection / fire-execution split, one slow webhook would block the worker from noticing the next 1000 due jobs.

⑦ Leader Election (ZooKeeper / etcd)

A consensus service that maintains the partition→worker assignment. When a worker joins or dies, the leader rebalances partitions among the survivors within seconds. Workers heartbeat to ZooKeeper; missed heartbeats trigger reassignment. Without this, two workers could both think they own partition 5 and double-fire every job in it.

Solves: ownership clarity in a dynamic cluster. Workers come and go (deploys, crashes, scale-outs) — partitions must always have exactly one owner. The librarian model: the head librarian decides which clerk is responsible for which shelf; if a clerk goes home sick, the head librarian reassigns their shelves to others before customers notice.

⑧ Execution Queue (Kafka topic)

A durable buffer between scheduler workers (which detect "this job is due now") and executor workers (which actually invoke the target). Each "due now" event becomes a message; partitioned by job_id so a single job's events land on the same Kafka partition (preserves ordering for retries). Buffers absorb spikes — at 07:00 sharp, scheduler workers may dump millions of "fire now" events into Kafka in seconds, which executors then drain at their own pace.

Solves: isolating the clock from the network. If executors talked to scheduler workers synchronously, a slow executor would back-pressure into the scheduler and the clock would drift. Kafka decouples them and lets each scale independently.

⑨ Executor Workers

A separate worker pool that consumes from the Execution Queue. For each message: check the Idempotency Store to make sure this job hasn't already been fired; invoke the target (HTTP POST or enqueue to a downstream Kafka/SQS topic); on success, write to the Audit Log and release the lease; on failure, hand the job to the Retry Manager. Pool sized for peak concurrent invocations — typically thousands of pods.

Solves: the "fire" half of fire-detection-and-fire-execution. Separating it from scheduler workers means we can scale executor capacity (thousands of pods making slow HTTP calls) independently from scheduler capacity (a few hundred fast pollers).

⑩ Retry Manager

When an executor fails to invoke a target (HTTP 5xx, timeout, downstream queue full), it hands the job here. Retry Manager computes the next retry time using exponential backoff with jitter — now + base × 2^attempt + random_jitter — and re-inserts the job into the Job Store with the new run_at. After max_retries, the job goes to the dead-letter queue and an alert fires.

Solves: transient failure resilience. Networks are flaky; downstream services have hiccups. Without retries, a single transient blip means the user's reminder is gone forever. With backoff + jitter, we ride out short outages without melting the recovering target with a stampede.

⑪ Idempotency Store (Redis)

A short-TTL key-value store keyed by execution_id = (job_id, scheduled_run_at). Before invoking a target, the executor does SETNX execution_id "in_progress". If the key already exists with status "done", skip — this is a duplicate fire from a lease-expiration retry. The TTL is generous (24 hours) so even slow retries are deduped.

Solves: the "duplicate fire" half of effectively-exactly-once. The lease mechanism ensures jobs always fire at least once, even if a worker crashes; the idempotency store ensures they fire at most once observable to the target. Together they approximate exactly-once.

⑫ Audit Log

Every fire, retry, success, failure, and cancellation appends a row to a durable log (DynamoDB time-series table or S3 + Parquet). Used for the GET /jobs/:id history view, post-mortem analysis ("why did Sarah's reminder fire 4 minutes late?"), and compliance audits ("prove this billing event fired").

Solves: observability and compliance. Without an audit trail, debugging a stuck job is a nightmare and there's no defense in a regulator's "show me when this billing notification fired" demand.

Concrete walkthrough — Sarah schedules a 3-day reminder

Two real scenarios, mapped to the numbered components above:

📅 Happy path — schedule and fire

At 14:02 today, Sarah's calendar app ① POSTs {run_at: "2027-05-10T14:02:00Z" (+3 days), target: webhook-url} to the Job API Server ②.
API server validates, generates job_id = job_01HXY..., hashes to partition_5, INSERTs into Job Store ③ with run_at = T+3d. Returns 201. ~40ms.
For 71 hours nothing happens — the job sits in DynamoDB, untouched, costing pennies.
~50 minutes before fire time, the Bucket Loader ⑤ runs its 10-min cycle, queries DynamoDB for partition_5 jobs due in the next hour, finds Sarah's job, and inserts it into Hot Bucket ④ bucket_p5 with score = unix(run_at).
At 14:01:59.900, the Scheduler Worker ⑥ owning partition_5 runs ZRANGEBYSCORE bucket_p5 -inf NOW, sees Sarah's job is due, sets a 5-minute lease in Redis, publishes a "fire" message to Execution Queue ⑧, ZREMs the entry. ~2ms total.
An Executor Worker ⑨ picks up the Kafka message, checks Idempotency Store ⑪ (no record), invokes Sarah's webhook URL, gets 200 OK, marks execution_id as done in idempotency store, writes "fired ok" to Audit Log ⑫, releases the lease. Total wall-clock from 14:02:00 to fire: ~150ms.

💥 Crash path — worker dies mid-execution

Same setup: at 14:02:00 the Scheduler Worker ⑥ leases the job and publishes to Kafka ⑧.
Executor Worker ⑨ picks up the message at 14:02:00.150, sets idempotency key to "in_progress", starts the HTTP call to Sarah's webhook…
…and at 14:02:00.300 the executor's host crashes (kernel panic, network partition, OOM kill).
The webhook call may or may not have reached Sarah's server — we don't know. The idempotency key is stuck on "in_progress".
5 minutes later, the lease in Redis expires. The Scheduler Worker for partition_5 sees the job is still in the hot bucket (or is re-loaded by a Bucket Loader sweep) and re-fires — publishes another Kafka message.
A different Executor Worker ⑨ picks it up. It checks the Idempotency Store ⑪: key is "in_progress" with TTL still alive, but the original holder is gone. Per policy, treat as safe to retry — overwrite the key and invoke. Sarah's webhook is idempotent (it checks her message-id), so the second call is a no-op if the first actually went through, or fires fresh if it didn't.
Result: Sarah gets at least one reminder, never zero, never two on her phone. The Audit Log ⑫ shows both attempts for forensic clarity.

So what: the architecture is built around four insights — (1) time-bucket the storage so finding due jobs is O(1); (2) shard across N workers so throughput scales horizontally; (3) two-tier hot/cold so the working set fits in RAM without sacrificing durability; (4) leased execution + idempotency so we get effectively-exactly-once semantics despite worker crashes. Every box in the diagram exists to neutralize one specific failure mode of the naive Pass-1 design.

Step 6

Time-Bucketing — The Storage Trick

The single most important storage idea in the entire design. Instead of one giant table with a run_at index, partition jobs by their fire-minute. Querying "what's due in the next minute" reads one bucket, regardless of how many billions of total jobs exist.

flowchart TB subgraph BUCKETS["Time Buckets — partitioned by run_at minute"] B1["bucket: 2027-05-10T14:00
jobs due at 14:00"] B2["bucket: 2027-05-10T14:01
jobs due at 14:01"] B3["bucket: 2027-05-10T14:02
jobs due at 14:02 — Sarah's job lives here"] B4["bucket: 2027-05-10T14:03
jobs due at 14:03"] B5["bucket: ..."] end CLOCK["Scheduler — current minute = 14:02"] -->|"reads ONE bucket"| B3 style B3 fill:#171d27,stroke:#e8743b,color:#d4dae5 style CLOCK fill:#e8743b,stroke:#e8743b,color:#fff

How the bucket key is constructed

Every job's primary partition key in the Job Store is (partition_id, run_at_minute). partition_id = hash(job_id) % N spreads load across shards; run_at_minute = the unix-minute timestamp of the scheduled fire time, truncated. A query like "all jobs in partition 5 due between 14:00 and 15:00" scans 60 contiguous bucket entries — extremely fast, regardless of whether the table holds 1 million or 1 billion total jobs.

⏱ Bucket granularity trade-off

Per-minute buckets are the sweet spot for most workloads. Per-second buckets create 60× more buckets (3600× when you factor in the hour-window scan) — too many small reads. Per-hour buckets dump 6M+ jobs into one read at peak — too coarse, defeats the point. Some systems use per-second for the "next 5 minutes" hot window and per-minute for further out — a hybrid.

🪣 Pigeonhole analogy

Imagine a wall of mailboxes labeled by date. The postman doesn't search a giant pile of letters every morning — they walk to today's box, pull out exactly the letters in there, deliver them. Time-bucketing is the same: the scheduler doesn't search "all jobs ever", it reads "today's box" (or this minute's). The size of the rest of the wall is irrelevant.

Before / after: with no bucketing, SELECT * FROM jobs WHERE run_at < NOW() over 1B rows takes ~2 seconds even with an index — too slow to poll every 100ms. With per-minute bucketing, fetching the current minute's bucket is ~5ms — fast enough to poll every 100ms with capacity to spare.

Step 7

Hot Bucket / Cold Bucket — Two-Tier Storage

500 GB of pending jobs cannot fit in RAM, but the scheduler doesn't need them in RAM — it only needs the next hour's worth. The two-tier pattern keeps everything cheap and durable, while making the part that matters at any given moment fast.

❄️ Cold tier — Job Store (DynamoDB / Cassandra)

The complete pending-jobs table. Sharded by hash(job_id), replicated 3×. Holds all 1B jobs no matter how far in the future. Optimized for durability and write throughput; query latency 5–20ms is fine because we only read it during background loading or rare individual-job lookups.

🔥 Hot tier — Redis ZSET per partition

An in-memory sorted set per partition, scored by unix-second of run_at. Holds only jobs due in the next 1–2 hours. ZRANGEBYSCORE in 0.5ms gives us "what's due now". Sharded across Redis nodes so no single node holds all 100K+ jobs/sec worth of working set.

The Bucket Loader's role

Every 10 minutes per partition, the Bucket Loader scans the cold DB for jobs due in the next 1 hour and inserts them into the corresponding hot ZSET. Idempotent — re-runs are safe. The loader uses the partition's secondary index (partition_id, run_at_minute) for a fast range query; it never does a full table scan.

sequenceDiagram participant L as Bucket Loader participant DB as Job Store cold participant R as Redis ZSET hot Note over L: Runs every 10 minutes per partition L->>DB: SELECT jobs WHERE partition_id=5 AND run_at BETWEEN now AND now+1h DB-->>L: 850K jobs loop for each job L->>R: ZADD bucket_p5 score=unix(run_at) member=job_id end R-->>L: OK Note over L: Sleep 10 min, repeat

Why 10 minutes and 1-hour window? The window must exceed the loader interval — if loader runs every 10 min and the window is 10 min, a job created right after one run misses the next run. 1-hour window with 10-min cadence gives a 50-minute safety margin. You can tune both: tighter cadence + smaller window = lower hot-tier RAM, higher loader-on-DB pressure.

Step 8

Sharding & Partition Assignment

One worker can't fire 100K jobs/sec. We fan out to N workers, each owning a slice. The trick is making sure each partition has exactly one owner at any moment — two owners means double-firing, zero owners means missed firings.

🔢 Partition assignment

Choose M partitions where M > expected worker count (say M=1024 logical partitions, with N=100 actual workers). Each worker owns ~10 partitions. Adding more workers means redistributing partitions — never reshuffling jobs. ZooKeeper or etcd holds the canonical partition→worker map.

💀 Worker death & rebalancing

Workers heartbeat every 5s. Two missed heartbeats (10s) → ZooKeeper marks the worker dead and reassigns its partitions to surviving workers. The dead worker's leases expire naturally over the next 5 minutes; surviving workers re-fire any in-flight jobs whose leases lapse. Total recovery time: ~10 seconds of "no firing" for that partition's slice, then back to normal.

Why per-partition Redis ZSETs?

Each partition has its own Redis key (bucket_p0, bucket_p1, …, bucket_p1023). A worker only touches the keys for partitions it owns — no cross-talk, no cross-shard transactions. Redis cluster mode automatically distributes these keys across Redis nodes by hash slot, so the working set scales horizontally without manual sharding.

The librarian analogy: the head librarian (ZooKeeper) writes a wall chart saying "Clerk Alice covers shelves 0–9, Clerk Bob covers shelves 10–19…". Each clerk works only their own shelves. If Alice goes home sick, the head librarian re-writes the chart and Bob picks up shelves 0–9 in addition to his own. Customers (jobs) never know there was a change — they just keep getting served.

Step 9

Lease-Based Effectively-Exactly-Once Execution

Truly exactly-once execution across a network is provably impossible (the Two Generals problem). What we can build is "at-least-once + idempotent target = effectively exactly once observable to the user". Leases are the mechanism that gets us the at-least-once half safely.

The lease lifecycle

sequenceDiagram participant W as Scheduler Worker participant R as Redis lease store participant K as Kafka exec queue participant E as Executor Worker participant T as Target webhook participant I as Idempotency Store W->>R: SET lease:job_X owner=W1 EX 300 R-->>W: OK W->>K: publish fire-event for job_X K-->>E: deliver E->>I: SETNX exec:job_X:t=14:02 in_progress I-->>E: OK E->>T: POST webhook T-->>E: 200 OK E->>I: SET exec:job_X:t=14:02 done E->>R: DEL lease:job_X

What happens on a worker crash

sequenceDiagram participant W as Scheduler Worker W1 participant R as Redis lease store participant E1 as Executor E1 — dies participant E2 as Executor E2 participant T as Target webhook participant I as Idempotency Store W->>R: SET lease:job_X owner=W1 EX 300 W->>E1: fire-event E1->>T: POST webhook — uncertain if it landed Note over E1: CRASH at 14:02:00.3 Note over R: lease:job_X TTL counting down... Note over R: 5 minutes later — TTL expires W->>R: ZRANGEBYSCORE bucket — sees job_X still due W->>R: SET lease:job_X owner=W1 EX 300 W->>E2: fire-event retry E2->>I: SETNX exec:job_X:t=14:02 I-->>E2: key exists in_progress — overwrite E2->>T: POST webhook — second attempt T-->>E2: 200 OK — target is idempotent E2->>I: SET done E2->>R: DEL lease:job_X

🔑 The lease as a library checkout

A lease is like checking a book out from the library with a 5-minute due date. While you have it, nobody else can take it. If you bring it back, fine. If you wander off and never return — after 5 minutes the librarian declares it lost and lets someone else have it. The lease in Redis with a TTL is exactly this: a finite-duration claim that auto-releases on timeout.

🎯 Why the target must be idempotent

The lease guarantees at-least-once: a job will fire at least once even if executors crash. It can't guarantee at-most-once because the executor may have completed the HTTP call and crashed before deleting the lease — the next executor will re-fire. That's why the target service must use the idempotency_key (or the execution_id) to dedup. Without an idempotent target, you get duplicate emails, duplicate billings, duplicate everything — and there's no fix at our layer.

The compromise spelled out: "exactly once" is a unicorn. What we ship is "at-least-once delivery + target idempotency = effectively exactly once observable to the user". Every real distributed scheduler (AWS EventBridge, Google Cloud Scheduler, Quartz cluster mode) makes the same trade.

Step 10

Cron / Recurring Jobs

Recurring jobs reuse the entire one-shot machinery. The trick: store the cron expression as a template, and convert each occurrence into a one-shot job.

📋 Template + next-fire pattern

When a client creates a cron job, we store: {cron_expr, timezone, target, template_id}. We compute next_run_at from the cron expression and create one one-shot job pointing at that time. When that one-shot fires successfully, the executor computes the next run_at and creates a fresh one-shot. The cron job is always represented as "exactly one upcoming one-shot".

🔄 Why not pre-create all future occurrences?

"Every minute forever" would create infinite rows. Lazy generation keeps the table bounded — at most one pending occurrence per cron template at any time. Trade-off: if a cron job is paused for 3 days, when it resumes you decide policy: fire once for the missed slot, or skip back to "now".

Computing the next run

Use a battle-tested cron parser (cron-utils in Java, croniter in Python). It accepts standard cron syntax plus extensions (0 9 * * MON, @daily, 0 0 1 * *). Always store the original expression and the user's timezone — compute next_run by converting "now in user TZ" forward, then converting back to UTC for storage. This handles DST transitions correctly.

Same machinery, two faces: the scheduler core knows nothing about cron expressions — it only fires one-shot jobs. The cron abstraction is a thin layer at the API and executor levels. This is the unix philosophy: do one thing well; build complex behaviors by composition.

Step 11

Retry & Backoff

Targets fail. Networks blip. Downstream services have hiccups. Retry policy must absorb transient failures without melting a recovering target with a stampede.

📈 Exponential backoff

delay = base × 2^attempt. With base = 30s: attempts at 30s, 1m, 2m, 4m, 8m. Bounded by max_attempts (typically 5–10) before dead-letter. Spreads retries over many minutes so a struggling target gets time to recover.

🎲 Jitter

Add ±25% random noise: delay × (0.75 + random()). Without jitter, all jobs that failed at 14:02 retry simultaneously at 14:02:30 — same thundering-herd problem we just solved. Jitter spreads the retries across a window.

📦 Dead letter queue

After max_retries, the job moves to a DLQ topic with the failure history. An on-call alert fires. Operators inspect, fix the target, and either replay the DLQ or accept the loss. Critical for "this billing event simply must fire" semantics.

Implementation: retry = re-schedule

The Retry Manager doesn't keep retries in memory. On failure, it computes the next run_at and writes a fresh row to the Job Store via the same POST /jobs path internally. This way retries flow through the same hot/cold/lease machinery as fresh jobs — no special code path. The retry count and original_job_id ride along as metadata.

Step 12

Cancellation

"I changed my mind — don't fire that reminder I scheduled for 7am." Easy when the fire is days away; tricky when it's 2 seconds away and the job is already in the hot bucket, possibly already popped by a worker.

🟢 Cancel — cooperative flag

POST /jobs/:id/cancel sets status = "cancelled" in the Job Store. We do NOT race-delete from the hot ZSET — that's flaky. Instead, when the Scheduler Worker pops the job from its bucket, it reads the job row and checks the cancelled flag. If cancelled, it skips firing and writes "cancelled" to the audit log. Worst case, the job is "popped and ignored" instead of "popped and fired".

⚠️ The 2-second-away edge case

Cancel arrives at 14:01:58 for a job firing at 14:02:00. At 14:02:00 the worker pops it; reads the cancelled flag; skips. Total latency from cancel-API to worker-skip: typically <200ms. If the cancel arrives during executor invocation (after worker pops, before HTTP call returns), it's too late — the job already fired. The contract: cancel is best-effort up to the moment of fire; once fired, you must compensate downstream.

The right invariant: the only place that decides "fire vs skip" is the worker reading the canonical row from the Job Store immediately before publishing to Kafka. Anything else is a race condition waiting to happen. This costs one extra DynamoDB read per fire (~5ms), which is fine — we're firing 100K/sec, the DB sees ~100K reads/sec on the cancellation check, well within capacity.

Step 13

Precise Timing

"Fire within a few seconds" works for most use cases — emails, push notifications, batch jobs. Some workloads demand tighter precision: financial markets opening at 09:30:00.000, gameplay events tied to a global clock, ad auction windows. The architecture supports tiered precision.

🕐 Standard tier — a few seconds

Default. Scheduler workers poll the hot ZSET every 1 second. Fire time = scheduled_run_at + 0–1s polling jitter + 50–200ms execution. End-to-end variance: under 2 seconds. Costs almost nothing.

⏱ Premium tier — sub-second

Same architecture, faster polling: scheduler workers poll every 100ms. End-to-end variance: under 250ms. Costs 10× more Redis CPU on the polling, but otherwise unchanged. Used for paid tier customers.

🎯 Real-time tier — millisecond

Different machinery: dedicated low-jitter executor pods using Linux timerfd (kernel-level timer interrupt) for the final hop. The scheduler hands off the job 1 second before fire time; the timerfd executor then waits for the exact instant and fires. Jitter: under 1ms. Used for financial / market-open workloads. Costs significantly more.

Why tiering works: 99% of jobs (reminders, batch, billing) don't care about sub-second precision. Charging everyone for premium would be wasteful. By offering tiered SLAs, you collect 99% of jobs on cheap infrastructure and reserve expensive low-jitter executors for the few jobs that pay for them.

Step 14

Fault Tolerance

Fire-and-forget systems are easy. Fire-on-time-or-explain-why is hard. Every component must survive its own failure modes; together they must survive the failures we can't predict.

💾 Job Store — 3× replicated

DynamoDB / Cassandra replicate every write to 3 nodes across 3 AZs. Quorum reads/writes (R=2, W=2) survive any single-node or single-AZ outage. The cold tier is the source of truth; even if every Redis node and every worker dies, the cold tier still holds every pending job.

🔥 Hot tier — Redis cluster + replicas

Redis runs in cluster mode with master + 1 replica per shard. On master failure, replica is promoted in seconds. If the entire hot tier is lost, the Bucket Loader rebuilds it from cold storage on its next sweep — bounded data loss, no jobs missed (cold tier is intact).

👷 Workers — stateless

Both Scheduler Workers and Executor Workers carry no state across restarts. State lives in Redis (lease, idempotency) and Job Store (canonical job rows). Killing and restarting a worker pod is a non-event — within seconds, ZooKeeper reassigns its partitions and another pod picks up the slack.

🗳 ZooKeeper — quorum consensus

3 or 5 ZooKeeper nodes form a quorum. As long as a majority survive, partition assignment continues to work. If the entire ZK cluster dies, no new partition assignments happen; existing workers keep running on their last-known assignments. Recovery: bring ZK back, workers re-register, normal flow resumes.

The "lose a whole AZ" test

If AZ-A goes dark — every worker, every Redis master, every API server in AZ-A vanishes — what happens? (1) ALB stops routing to AZ-A. (2) ZK promotes ZK followers from B/C, partitions reassigned to surviving workers. (3) Redis replicas in B/C promote to masters. (4) DynamoDB serves reads from B/C replicas. (5) Within 30–60 seconds the system is fully operational from B+C. Pending jobs scheduled during the outage may fire 30–60s late but none are lost.

Step 15

Multi-Region

For latency, compliance, and disaster tolerance, run independent scheduler stacks per region. Most jobs stick to their home region; a small number of "global cron" jobs need cross-region coordination.

🌍 Per-region partitioning

Jobs are pinned to a region at creation time, usually based on the user's home region or a region tag in the API call. Each region has its own Job Store, hot tier, scheduler workers, and executors — completely independent. No cross-region traffic on the fire path → low latency, no cross-region failure coupling.

🌐 Global cron expressions

"Run this job once globally at 09:00 UTC" — only one region should fire it. Designate a primary region for the cron template (us-east-1). The primary region creates the next one-shot occurrence; if that region is down, a standby region detects it via cross-region heartbeat and takes over. Costs occasional cross-region replication of the template registry (small, low-traffic), keeps the fire path in one region.

The pattern: stay regional by default, opt into cross-region only for the small slice of jobs that explicitly need it. Most production schedulers (AWS EventBridge, GCP Scheduler) follow exactly this — region-scoped resources with optional global-replicated overlay.

Step 16

Interview Q&A

How does this differ from cron on a single Linux box?

Linux cron is a single process reading a single config file on a single machine. It doesn't survive the machine dying, doesn't scale past a few hundred jobs, has no retry semantics, no cancellation, no observability. Our distributed scheduler keeps the cron model (declarative schedule → triggered execution) and adds: durability across machine failure (replicated job store), horizontal scale (sharded workers), exactly-once-effective execution (leases + idempotency), retries with backoff, cancel, and audit. Linux cron is a tool; this is a service.

What if 1M jobs all fire at exactly 07:00:00?

The architecture absorbs the spike via three buffers. (1) The hot ZSET pre-loaded those jobs an hour earlier — no DB scan at fire time. (2) Sharded workers process them in parallel: with 100 workers each handling 10K jobs/sec, the burst clears in ~1 second. (3) Kafka buffers the "fire" events between scheduler workers and executors, smoothing the spike — executors drain at their steady throughput while the queue temporarily inflates. End-to-end: 1M jobs scheduled for 07:00:00 actually fire between 07:00:00.1 and 07:00:02.5 — well within the "few seconds of scheduled time" SLA.

How do you handle a worker crashing mid-execution?

Leases and idempotency. When the scheduler worker pops a job, it sets a 5-minute lease in Redis. If the worker crashes after popping but before the executor finishes, the lease eventually expires and the job comes back into play — either via a Bucket Loader re-load or because the original entry was never ZREMed (depends on where exactly the crash happened). Another worker fires it. The target dedup via execution_id = (job_id, scheduled_run_at); if the original call actually went through, the second is a no-op. Net result: at-least-once delivery + idempotent target = the user observes exactly one fire.

How do you scale to 1B pending jobs?

Three orthogonal axes of scale. (1) Job Store sharded by hash(job_id) across DynamoDB partitions — 1B jobs at 500B each = 500GB, trivial for DynamoDB to spread across nodes. (2) Hot tier sharded across Redis cluster nodes by partition_id — at most 100K/s × 3600s = 360M jobs in the 1-hour window, spread across ~50 Redis shards. (3) Workers scaled horizontally — N workers each owning M/N of the 1024 logical partitions, add more workers to grow throughput. Nothing in the architecture is a vertical bottleneck; every layer scales by adding capacity.

Why two-tier storage (hot Redis + cold DynamoDB)?

Different access patterns at vastly different cost. The cold tier holds 1B jobs but is touched only by background loaders (every 10 min, batched range queries) and individual API reads — perfectly suited to DynamoDB's "cheap storage, modest QPS" profile. The hot tier holds <1% of jobs but is hammered at 1000+ QPS per worker — perfectly suited to Redis's "expensive RAM, free QPS" profile. Trying to do everything in DynamoDB would cost a fortune in read units; trying to do everything in Redis would cost a fortune in RAM and lose durability. Two tiers, each playing to its strength, cost ~10× less than either single-tier alternative.

How do you cancel a job that's about to fire in 2 seconds?

Cooperative cancellation via a status flag. The cancel API just sets status = "cancelled" in the Job Store. We do not race-delete from Redis — that's flaky and creates inconsistency. Instead, when the scheduler worker pops the job for firing, it reads the canonical row from Job Store (one extra ~5ms read) and checks the flag; if cancelled, it skips and writes "cancelled" to the audit log. The window of "too late to cancel" is tiny — only between worker-pops-and-publishes-to-Kafka and executor-actually-fires, typically <200ms. If you cancel before that window, you're guaranteed safe; inside it, best-effort.

What's the difference between exactly-once and at-least-once + idempotent?

Exactly-once across a network is provably impossible (Two Generals problem — you can never simultaneously guarantee delivery and confirm-of-delivery). What real systems ship is "at-least-once delivery with idempotent receivers" — the producer guarantees the message will be delivered ≥1 times, the consumer dedups so duplicates are absorbed. The user-observable behavior is indistinguishable from exactly-once. Our scheduler does this: leases + retries → at-least-once; execution_id in the idempotency store + idempotent target → dedup on the consumer side. The literature calls this "effectively exactly-once" — every production scheduler (Quartz cluster, Airflow, AWS EventBridge, GCP Scheduler) takes this trade.

How do you avoid hot partitions when 90% of jobs target 09:00?

Partition by job_id, not by run_at. If we partitioned by run_at_minute, the 09:00 bucket would contain billions of jobs and one Redis shard would melt. By partitioning by hash(job_id) % N, the 09:00 firings are spread evenly across all N partitions — each shard sees only 1/N of the load, no matter how time-skewed the scheduling distribution is. The time bucket within each partition is still per-minute, so reads are O(1), but the total fire load at 09:00 is amortized across the whole cluster.

The one-line summary the interviewer remembers: "It's a sharded, time-bucketed scheduler with two-tier storage — DynamoDB cold, Redis hot — where workers poll their assigned partitions every 100ms, lease jobs to survive crashes, and rely on idempotency keys at the target for effectively-exactly-once semantics. Every box in the architecture exists to neutralize one specific failure of a single-process cron daemon."