← Back to Design & Development
Low-Level Design

Notification Service

Multi-channel, retry-safe, template-driven — Strategy, Observer, Decorator, Chain of Responsibility & async workers, end-to-end

Step 1

Clarify Requirements

Imagine Sarah just placed an order on a food-delivery app. Within seconds, she expects a push notification on her phone, an email receipt in her inbox, and an SMS when her food is out for delivery. The Notification Service is the single backend system that fans every "something happened" event out to the right channel, in the right format, in the right language, while respecting Sarah's preferences and never spamming her if a downstream provider melts down. Let's pin down what it must do — and what we're explicitly leaving out.

✅ Functional Requirements

  • Send notifications across multiple channels — Email, SMS, Push, In-app, WhatsApp, Slack
  • Support templates with variable substitution (Hi {{name}}, your order #{{id}} is on the way)
  • Respect per-user preferences (channel opt-out, do-not-disturb hours, language)
  • Schedule notifications for a future time and support recurring sends
  • Retry failed sends with exponential backoff; route to dead-letter after N attempts
  • Apply rate-limiting per user / per channel to prevent spam & respect provider quotas
  • Audit log: every send attempt with status, provider, latency & cost
  • Read-receipt callbacks (delivered / opened / clicked) from providers

⚙️ Non-Functional Requirements

  • Throughput: 10K notifications/sec at peak (Black Friday-class load)
  • Latency: p99 < 2s from API call to provider hand-off for high-priority traffic
  • Reliability: at-least-once delivery; never lose a message that was accepted
  • Extensibility: adding a new channel (e.g., Discord) should be one new class, no edits elsewhere

❌ Out of Scope

  • Drafting notification copy (that's a content-team job)
  • Building the email/SMS providers themselves — we integrate with SendGrid, Twilio, FCM, etc.
  • A/B testing of templates — assume an upstream experimentation system
  • End-user UI for managing preferences (we expose APIs only)
Why these matter: retries + rate-limits + preferences are the three things that separate a 100-line "send an email" script from a real notification service. If you skip them in an interview, the next 30 minutes are spent retrofitting them awkwardly.
Step 2

Actors & Use Cases

Four kinds of clients touch this system. Upstream microservices (Order, Payment, Auth) are the ones generating events — they're 99% of the traffic. End users never call the service directly; they only see the notifications and update their preferences via a settings UI. Admins manage templates and watch dashboards. External providers (Twilio, SendGrid) call back into us with delivery receipts.

flowchart LR MS(["Microservice — Order, Payment, Auth"]) U(["End User"]) A(["Admin"]) P(["External Provider — Twilio, SendGrid, FCM"]) MS --> SN["Send Notification"] MS --> SCH["Schedule Notification"] MS --> CAN["Cancel Scheduled"] U --> PREF["Update Preferences"] U --> UNSUB["Unsubscribe"] A --> TPL["Manage Templates"] A --> DASH["View Audit Dashboard"] A --> RP["Replay Failed Notifications"] P --> CB["Delivery Callback"] style MS fill:#e8743b,stroke:#e8743b,color:#fff style U fill:#4a90d9,stroke:#4a90d9,color:#fff style A fill:#9b72cf,stroke:#9b72cf,color:#fff style P fill:#38b265,stroke:#38b265,color:#fff

🟢 Happy Path

  • Order Service POSTs /notifications
  • System renders template + checks preferences
  • Pushes to channel queue
  • Worker calls provider → success
  • Audit log updated to DELIVERED

🟡 Retry Path

  • Provider returns 503 / timeout
  • Worker marks attempt as FAILED
  • Re-enqueues with backoff (1s, 4s, 16s…)
  • After N attempts → dead-letter queue

🔴 Edge Cases

  • User has opted out of SMS
  • Quiet hours active for user's timezone
  • Rate limit exceeded for the channel
  • Template missing a required variable
  • Same event fired twice (idempotency)
Step 3

Architecture — From Naive to Production

Before any class diagram, let's build up the architecture in three passes: the simplest thing that could work, why it falls over, and the production shape that earns every box.

Pass 1 — The naive design (and why it breaks)

Your first instinct: a single REST endpoint POST /send. The Order Service calls it, you connect to SendGrid synchronously, return 200, done. Three lines of code.

flowchart LR OS["Order Service"] --> NS["Notification API"] NS --> SG[("SendGrid")] NS --> TW[("Twilio")] NS --> FCM[("FCM")]

Now fast-forward to Black Friday. Three concrete failures:

💥 Provider down = Order down

SendGrid takes 30 seconds to time out. The Order Service is now blocked for 30 seconds per checkout. Customers see spinners and bounce. Synchronous coupling propagates failure upstream.

💥 Burst traffic kills providers

10K orders in one minute means 10K parallel SMS calls. Twilio rate-limits us to 100/sec and starts returning 429s. Every retry is a new call, amplifying the storm. No queue, no smoothing.

💥 Preferences scattered everywhere

Sarah unsubscribed from marketing emails. Did the Order Service check that? The Loyalty Service? The Promo Service? Without a central preference layer, every caller re-implements the rule and someone gets it wrong. Compliance lawsuit waiting to happen.

Pass 2 — The mental model: API plane vs Worker plane

The single most important split in this system is between accepting a notification request (must be fast, synchronous, returns an ID) and delivering it (slow, async, talks to flaky third parties). They have wildly different SLAs and scale independently. Once you internalize that split, every other component falls out naturally.

📥 API Plane (Synchronous)

Light, stateless, fast. Validates the request, enriches it with user preferences & template, writes to DB, drops a job on a queue, returns 202 Accepted with a tracking ID. Latency budget: 50ms. Scales horizontally with load balancer.

What lives here: REST controllers, request validators, the Preference Service, the Template Service, the Idempotency check.

⚙️ Worker Plane (Asynchronous)

Heavy, stateful per-attempt, slow. Pulls jobs off the queue, picks the right channel handler, calls the provider, handles retries with backoff, writes the audit row. Latency budget: seconds, not ms. Scales independently per channel.

What lives here: channel workers (Email/SMS/Push), retry coordinator, rate limiter, dead-letter handler.

Pass 3 — The production shape

Now we draw the whole thing. Every box has a number — find its matching card below to see what it does, and crucially, what would break without it.

flowchart TB OS["① Order / Payment / Auth Service"] subgraph APIPLANE["API Plane"] LB["② Load Balancer"] NAPI["③ Notification API"] IDP["④ Idempotency Store — Redis"] PREF["⑤ Preference Service"] TPL["⑥ Template Service"] DB[("⑦ Notification DB — Postgres")] Q["⑧ Channel Queues — Kafka"] end subgraph WORKERPLANE["Worker Plane"] RL["⑨ Rate Limiter"] CH{"⑩ Channel Dispatcher"} EW["⑪ Email Worker"] SW["⑪ SMS Worker"] PW["⑪ Push Worker"] EP[("⑫ SendGrid")] SP[("⑫ Twilio")] FP[("⑫ FCM / APNS")] DLQ["⑬ Dead Letter Queue"] end CB["⑭ Provider Callback API"] AUD["⑮ Audit / Metrics"] OS --> LB LB --> NAPI NAPI --> IDP NAPI --> PREF NAPI --> TPL NAPI --> DB NAPI --> Q Q --> RL RL --> CH CH -->|Email| EW CH -->|SMS| SW CH -->|Push| PW EW --> EP SW --> SP PW --> FP EW -.failed.-> DLQ SW -.failed.-> DLQ PW -.failed.-> DLQ EP -.delivery receipt.-> CB SP -.delivery receipt.-> CB FP -.delivery receipt.-> CB CB --> DB Q -.events.-> AUD EW -.events.-> AUD style OS fill:#e8743b,stroke:#e8743b,color:#fff style LB fill:#171d27,stroke:#9b72cf,color:#d4dae5 style NAPI fill:#171d27,stroke:#e8743b,color:#d4dae5 style IDP fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style PREF fill:#171d27,stroke:#4a90d9,color:#d4dae5 style TPL fill:#171d27,stroke:#4a90d9,color:#d4dae5 style DB fill:#171d27,stroke:#38b265,color:#d4dae5 style Q fill:#171d27,stroke:#d4a838,color:#d4dae5 style RL fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style CH fill:#171d27,stroke:#9b72cf,color:#d4dae5 style EW fill:#171d27,stroke:#4a90d9,color:#d4dae5 style SW fill:#171d27,stroke:#4a90d9,color:#d4dae5 style PW fill:#171d27,stroke:#4a90d9,color:#d4dae5 style EP fill:#171d27,stroke:#38b265,color:#d4dae5 style SP fill:#171d27,stroke:#38b265,color:#d4dae5 style FP fill:#171d27,stroke:#38b265,color:#d4dae5 style DLQ fill:#171d27,stroke:#e05252,color:#d4dae5 style CB fill:#171d27,stroke:#e8743b,color:#d4dae5 style AUD fill:#171d27,stroke:#d4a838,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card below. Each one answers: what is this?, why is it here?, and what breaks without it?

Calling Microservice

Any internal service that wants to notify a user — Order, Payment, Auth, Promotions. They speak to us via a thin SDK that wraps the REST API, so they don't need to know about queues, retries, or providers. They send one HTTP call and forget.

Solves: without a single notification service, every team would re-integrate Twilio/SendGrid themselves. We'd have 12 different retry policies and 12 different bugs.

Load Balancer

Spreads incoming HTTP across all API instances. Standard nginx / AWS ALB. Health-checks the API pods every few seconds and yanks any that go unhealthy.

Solves: a single API instance dies under any real traffic. The LB lets us scale the API plane horizontally without clients knowing.

Notification API

The stateless front door. Validates the request schema (channel valid? template exists? user_id present?), pulls the user's preferences, renders the template with the supplied variables, writes a row to the DB with status PENDING, and drops a job onto the right channel queue. Returns 202 Accepted with a notification_id the caller can use to track status. Total time: under 50ms.

Solves: if the API tried to call providers directly, every API instance would be tied up waiting on Twilio for seconds at a time. Decoupling write-to-queue from actually-send is what lets us absorb spikes.

Idempotency Store (Redis)

Every request carries a client-supplied idempotency_key (e.g., order-9876-confirmation). Before doing any work, the API does SETNX key 1 EX 86400. If the key existed, it returns the original notification's ID without doing anything. Keys live for 24 hours.

Solves: microservices retry. If the Order Service calls us, times out, and retries, we'd send Sarah two emails — and she'd churn. Idempotency makes "retry" safe by default.

Preference Service

Holds Sarah's truth: she opted out of marketing SMS, prefers English, has quiet hours from 10pm to 7am IST. The API queries this for every send and either accepts, downgrades the channel (e.g., SMS → in-app), or drops it entirely with reason SUPPRESSED_BY_PREFERENCE.

Solves: compliance. GDPR, TCPA, CAN-SPAM all require honoring opt-outs. Centralizing this rule means one place to audit, not twelve places to debug.

Template Service

Stores parameterized message templates by ID and version, e.g., ORDER_CONFIRMED_v3. The API passes in {name: "Sarah", order_id: "9876"} and gets back "Hi Sarah, your order #9876 is on the way!". Supports localization — picks the template variant matching the user's language.

Solves: without templates, the Order Service would hardcode message strings, and changing copy would require a deploy. With templates, the marketing team edits a row in a table.

Notification DB (Postgres)

The source of truth for every notification. Stores the rendered payload, recipient, channel, status (PENDING / SENT / DELIVERED / FAILED / DEAD), attempts, provider response, and timestamps. Every state change writes here. The API and the Worker both read & write it.

Solves: support tickets ("did Sarah get her receipt?"), audit trails, and the ability to replay any notification from history. Without a DB, a queue-only design loses everything the moment a message is consumed.

Channel Queues

Kafka topics (or RabbitMQ exchanges) — one per channel: notifications.email, notifications.sms, notifications.push. The API publishes here; workers consume. Each topic is partitioned by user_id so all of Sarah's notifications land on the same partition (preserving order per user).

Solves: back-pressure. If SendGrid is slow today, the email topic fills up but the SMS workers keep humming. One slow channel can't block the others.

Rate Limiter

Token-bucket sitting between the queue and the workers. Two limits: per-user (Sarah gets max 5 SMS/hour to prevent harassment) and per-provider (we never exceed Twilio's 100/sec quota for our account). Backed by Redis.

Solves: two pains at once. Without per-user limits, a runaway loop in the Promo Service spams Sarah 1000 times. Without per-provider limits, we get our Twilio account temporarily banned.

Channel Dispatcher

A small in-worker router that looks at the message's channel field and hands it to the correct strategy class (EmailSender, SMSSender, PushSender). This is where the Strategy pattern lives — adding WhatsApp later means adding one new sender class and registering it here.

Solves: if every worker had giant if (channel == EMAIL) ... else if ... blocks, adding a channel would mean editing ten files. The dispatcher localizes that branching to one place.

Channel Workers

One worker pool per channel. Pulls a message, calls its sender strategy, handles the response. On success: updates DB to SENT, records latency. On failure: increments attempt count, schedules a retry with exponential backoff, or routes to dead-letter if attempts exceeded. Each pool autoscales independently — push workers might be 10 pods, email workers might be 50.

Solves: independent scaling. Push notifications are 10× cheaper to send than emails, so we want different pool sizes. Lumping them into one pool means we can't tune them separately.

External Providers

The actual delivery infrastructure: SendGrid for email, Twilio for SMS, FCM/APNS for mobile push. We treat them as black boxes behind our sender strategies. Each one has its own API quirks, error codes, and rate limits — those quirks live inside the corresponding sender class and never leak out.

Solves: we don't build email infra. SendGrid handles SPF, DKIM, deliverability, ISP relations — billions of dollars of work we'd never replicate.

Dead Letter Queue (DLQ)

Where messages go to die — but not silently. After N retry attempts (typically 5), the message is moved here with the full error history. An on-call engineer is paged if the DLQ depth crosses a threshold. From here, messages can be inspected, fixed, and replayed via the admin tool.

Solves: visibility into systemic failure. Without a DLQ, a bad template or an expired API key would silently fail forever and nobody would notice until customers complained.

Provider Callback API

A separate webhook endpoint that providers call back asynchronously to tell us "this message was actually delivered to the device" or "the user opened it" or "the user clicked the link". Updates the DB row from SENT to DELIVERED / OPENED / CLICKED.

Solves: "did SendGrid take it" ≠ "did it land in Sarah's inbox". Read receipts give product teams real engagement data and tell us when a phone is unreachable so we can stop trying.

Audit & Metrics

Every state change emits an event to a metrics pipeline (Prometheus + Grafana for live dashboards, ClickHouse for historical queries). Tracks success rate per channel, per provider, per template — surfacing things like "SMS to India dropped from 99% to 84% delivery this hour".

Solves: operating blind. You only know your provider switched their behavior when graphs change shape. Without dashboards, the first signal is angry customers.

Concrete walkthrough — Sarah's order confirmation

Sarah hits "Place Order" at 14:02:06. Here's exactly what happens, mapped to the numbered components above:

  1. Order Service ① POSTs {user_id: 42, template: "ORDER_CONFIRMED_v3", channel: "EMAIL", vars: {...}, idempotency_key: "order-9876-confirm"} to the Load Balancer ②.
  2. The LB routes it to one of three Notification API ③ pods. The API checks Redis ④ — the key doesn't exist, so it claims it.
  3. API calls Preference Service ⑤ → "Sarah accepts transactional email" ✓. Calls Template Service ⑥ with vars → renders subject & body.
  4. API writes a row to Postgres ⑦ with status PENDING, then publishes the message to the notifications.email topic in Kafka ⑧, partitioned by user_id=42. Returns 202 {notification_id: "n_abc123"} to the Order Service. Elapsed: 38ms.
  5. An Email Worker ⑪ pulls the message. The Rate Limiter ⑨ checks: Sarah's email rate ✓, SendGrid's quota ✓.
  6. The Channel Dispatcher ⑩ routes to the EmailSender strategy. EmailSender calls SendGrid ⑫. Returns 202 Accepted in 740ms.
  7. Worker updates Postgres row to SENT, emits a notification.sent event to Audit ⑮.
  8. Six seconds later, SendGrid POSTs to our Callback API ⑭ → delivered. We update the row to DELIVERED.
  9. Two minutes later, Sarah clicks the tracking link in the email. SendGrid POSTs again → clicked. Row updates to CLICKED. Marketing now knows the email worked.
So what: the magic isn't any single component — it's the split between the API plane (fast, synchronous, returns immediately) and the Worker plane (slow, async, retries on its own). That split is what lets the Order Service finish in 50ms even when SendGrid is having a bad day.
Step 4

Core Entities & Enums

Now we map the architecture to objects. The trick is to find the nouns ("notification", "template", "channel", "user preference") and the things that vary ("which channel?", "which retry policy?"). Variations become enums or strategy classes; nouns become entities.

EntityResponsibilityKey Fields
NotificationOne send attempt for one user on one channel — the unit of workid, userId, channel, templateId, payload, status, attempts, createdAt
NotificationRequestThe inbound API DTO — what callers actually postuserId, templateId, channel, vars, idempotencyKey, priority, scheduleAt
TemplateA reusable, parameterized message body per channel + localeid, version, channel, locale, subject, body, requiredVars[]
UserPreferencePer-user channel opt-ins, language, quiet-hours windowuserId, channelOptIns Map, language, quietStart, quietEnd, timezone
RecipientResolved contact info for a user — email, phone, deviceTokensuserId, email, phone, deviceTokens List, slackId
DeliveryAttemptOne try at sending — captures provider response & latencyid, notificationId, attemptNo, providerCode, status, errorMsg, latencyMs, attemptedAt
ChannelSenderStrategy interface — one impl per channel (Email/SMS/Push/...)+ send(notif): Result, + getChannel(): Channel
RetryPolicyDecides if & when to retry after a failure+ shouldRetry(attemptNo, error): boolean, + nextDelay(attemptNo): Duration
RateLimiterToken-bucket gate — per user & per provider+ tryAcquire(key): boolean

📦 Channel (Enum)

EMAIL, SMS, PUSH, IN_APP, WHATSAPP, SLACK

Each maps to one ChannelSender strategy. Adding a new channel = adding one enum constant + one sender class.

📦 Status (Enum)

PENDING → SENT → DELIVERED → OPENED / CLICKED

Sad path: PENDING → FAILED → RETRY → ... → DEAD or SUPPRESSED when preferences block.

📦 Priority (Enum)

CRITICAL, HIGH, NORMAL, LOW

Critical (OTP, payment receipts) skips quiet-hours and rate limits. Low (marketing) gets dropped first when overloaded.

Step 5

ER / Schema Diagram

Before classes, lock down the data model. Notice how delivery_attempt is a separate table — every retry gets a new row, so we have a complete forensic trail when things go wrong. The idempotency_key is unique per client, preventing duplicate sends.

erDiagram USER { bigint id PK string name string email string phone string timezone string language } USER_PREFERENCE { bigint user_id PK string channel PK boolean opt_in time quiet_start time quiet_end } DEVICE { bigint id PK bigint user_id FK string platform string token timestamp last_seen } TEMPLATE { string id PK int version PK string channel string locale text subject text body json required_vars } NOTIFICATION { string id PK bigint user_id FK string template_id FK int template_version FK string channel string priority string status json payload string idempotency_key int attempts timestamp scheduled_at timestamp created_at } DELIVERY_ATTEMPT { bigint id PK string notification_id FK int attempt_no string provider string status string error_msg int latency_ms timestamp attempted_at } USER ||--o{ USER_PREFERENCE : "configures" USER ||--o{ DEVICE : "owns" USER ||--o{ NOTIFICATION : "receives" TEMPLATE ||--o{ NOTIFICATION : "renders" NOTIFICATION ||--o{ DELIVERY_ATTEMPT : "tracks"
Indexing notes: (user_id, created_at DESC) for "show me Sarah's recent notifications". (status, scheduled_at) for the worker poll query. idempotency_key needs a UNIQUE index. (template_id, locale) for template lookup.
Step 6

Class Diagram

The NotificationService is the façade. It composes a Preference checker, a Template renderer, and a SenderFactory that resolves the right strategy per channel. ChannelSender is the strategy interface — each concrete sender is a swappable implementation.

classDiagram class NotificationService { -PreferenceService prefs -TemplateService templates -NotificationRepository repo -QueuePublisher publisher -IdempotencyStore idempotency +send(req) String +schedule(req, when) String +cancel(notifId) boolean } class NotificationWorker { -SenderFactory senderFactory -RetryPolicy retryPolicy -RateLimiter limiter -NotificationRepository repo +consume(notification) void } class ChannelSender { <> +send(notification) DeliveryResult +getChannel() Channel } class EmailSender { +send(notification) DeliveryResult +getChannel() Channel } class SMSSender { +send(notification) DeliveryResult +getChannel() Channel } class PushSender { +send(notification) DeliveryResult +getChannel() Channel } class WhatsAppSender { +send(notification) DeliveryResult +getChannel() Channel } class SenderFactory { -Map registry +get(channel) ChannelSender +register(sender) void } class RetryPolicy { <> +shouldRetry(attemptNo, cause) boolean +nextDelay(attemptNo) Duration } class ExponentialBackoffPolicy { -int maxAttempts -Duration baseDelay +shouldRetry(attemptNo, cause) boolean +nextDelay(attemptNo) Duration } class RateLimiter { <> +tryAcquire(key) boolean } class TokenBucketLimiter { -long capacity -long refillPerSec +tryAcquire(key) boolean } class Notification { -String id -long userId -Channel channel -Status status -int attempts -String payload +markSent() void +markRetry() void +markDead() void } class PreferenceService { +isAllowed(userId, channel, priority) Decision } class TemplateService { +render(templateId, vars, locale) RenderedMessage } NotificationService --> PreferenceService NotificationService --> TemplateService NotificationService --> NotificationRepository NotificationService --> QueuePublisher NotificationWorker --> SenderFactory NotificationWorker --> RetryPolicy NotificationWorker --> RateLimiter NotificationWorker --> NotificationRepository SenderFactory o-- ChannelSender ChannelSender <|.. EmailSender ChannelSender <|.. SMSSender ChannelSender <|.. PushSender ChannelSender <|.. WhatsAppSender RetryPolicy <|.. ExponentialBackoffPolicy RateLimiter <|.. TokenBucketLimiter
Step 7

Design Patterns Used

This problem is a pattern playground because the variation points are loud and clear. Each pattern below is paired with the exact question it answers.

🎯 Strategy — ChannelSender

Question it answers: "How do I send via email vs SMS vs push without a giant if-else?"

Each channel implements the same ChannelSender interface. The worker holds a reference to the interface and never knows which concrete sender it's calling. Adding WhatsApp = one new class, zero edits to existing code (Open/Closed).

🏭 Factory — SenderFactory

Question it answers: "Given a channel enum, how do I get the right sender?"

Maintains a Map<Channel, ChannelSender> populated at startup. The worker calls factory.get(channel) instead of newing senders. Makes mocking in tests trivial.

👀 Observer — Audit / Metrics

Question it answers: "How do I emit metrics & audit logs without coupling them to send logic?"

The worker publishes lifecycle events (SENT, FAILED, DELIVERED) to a topic. Audit, Metrics, and Webhook subscribers each consume independently. Adding a new subscriber doesn't touch send code.

🎨 Decorator — Cross-cutting concerns

Question it answers: "Where do I put rate-limiting, logging, and circuit-breaking without polluting senders?"

Wrap a ChannelSender in RateLimitedSender, then in LoggingSender, then in CircuitBreakerSender. Each layer adds one concern. Composition order is configurable.

⛓️ Chain of Responsibility — Pre-send pipeline

Question it answers: "How do I run a sequence of validations (preferences → quiet hours → rate limit → template) cleanly?"

Each gate is a handler that either passes the notification along or short-circuits with SUPPRESSED. Easy to reorder, easy to add a new gate.

🔨 Builder — NotificationRequest

Question it answers: "How do I construct a request with 8 optional fields without a 12-arg constructor?"

Fluent builder: NotificationRequest.builder().to(42).template("X").channel(EMAIL).var("name", "Sarah").priority(HIGH).build().

🪞 Singleton — Service registry & factory

Question it answers: "How many SenderFactory instances should exist?"

One. The factory holds the registry of all channels. In production this is a Spring bean; in pure Java it's an enum-based singleton.

🧊 Template Method — AbstractChannelSender

Question it answers: "Every sender does the same setup/teardown — how do I avoid duplicating it?"

Abstract base class implements send() as: validate → call doSend() hook → record metrics. Concrete senders only override doSend().

Step 8

Notification Lifecycle — State Machine

A notification's life is more nuanced than "sent / not sent". Modeling every state explicitly lets us answer support questions like "did Sarah open the email?" and lets the worker safely resume after a crash by inspecting the state.

stateDiagram-v2 [*] --> PENDING : created PENDING --> SUPPRESSED : preference or quiet hours PENDING --> SENT : provider accepted PENDING --> RETRY : transient failure RETRY --> SENT : retry succeeded RETRY --> RETRY : still failing RETRY --> DEAD : max attempts hit SENT --> DELIVERED : provider callback SENT --> BOUNCED : provider callback hard fail DELIVERED --> OPENED : user opened email OPENED --> CLICKED : user clicked link SUPPRESSED --> [*] DEAD --> [*] CLICKED --> [*] BOUNCED --> [*]
Why split SENT and DELIVERED? "SendGrid took the message" and "the message landed in Sarah's inbox" are different events that arrive seconds-to-minutes apart. Conflating them hides bouncing addresses and makes "delivery rate" metrics meaningless.
Step 9

Send Flow — Sequence Diagram

The happy-path send, end-to-end. Notice how the API returns to the caller before the provider is even called — that's the API/Worker split paying off.

sequenceDiagram actor OS as Order Service participant API as Notification API participant IDP as Idempotency Redis participant PREF as Preference Svc participant TPL as Template Svc participant DB as Postgres participant Q as Kafka participant W as Worker participant RL as Rate Limiter participant SG as SendGrid OS->>API: POST /notifications with idemKey API->>IDP: SETNX idemKey alt key already exists IDP-->>API: existed API-->>OS: 200 existing notif_id else new request IDP-->>API: claimed API->>PREF: isAllowed user EMAIL HIGH PREF-->>API: ALLOWED API->>TPL: render template vars locale TPL-->>API: subject and body API->>DB: INSERT notification PENDING API->>Q: publish to notifications.email API-->>OS: 202 with notif_id end Q-->>W: consume message W->>RL: tryAcquire user and provider RL-->>W: OK W->>SG: POST mail send alt success SG-->>W: 202 Accepted W->>DB: UPDATE status SENT else transient failure SG-->>W: 503 Service Unavailable W->>DB: UPDATE status RETRY W->>Q: publish with backoff delay end Note over SG,DB: Async callback later SG->>API: POST webhook delivered API->>DB: UPDATE status DELIVERED
Step 10

Java Implementation

Pure Java 17, no frameworks. Order: enums → records → strategy interface → concrete senders → factory → service → worker → demo.

Channel.java — Enum
public enum Channel { EMAIL, SMS, PUSH, IN_APP, WHATSAPP, SLACK }
Priority.java & Status.java
public enum Priority { CRITICAL, HIGH, NORMAL, LOW }

public enum Status {
    PENDING, SUPPRESSED, SENT, RETRY,
    DEAD, DELIVERED, BOUNCED, OPENED, CLICKED
}
NotificationRequest.java — Record + Builder
import java.time.Instant;
import java.util.*;

public record NotificationRequest(
    long userId,
    String templateId,
    Channel channel,
    Map<String, Object> vars,
    String idempotencyKey,
    Priority priority,
    Instant scheduledAt
) {
    public static Builder builder() { return new Builder(); }

    public static class Builder {
        private long userId;
        private String templateId;
        private Channel channel;
        private Map<String, Object> vars = new HashMap<>();
        private String idempotencyKey = UUID.randomUUID().toString();
        private Priority priority = Priority.NORMAL;
        private Instant scheduledAt = Instant.now();

        public Builder to(long u)            { this.userId = u;        return this; }
        public Builder template(String t)     { this.templateId = t;    return this; }
        public Builder channel(Channel c)     { this.channel = c;       return this; }
        public Builder var(String k, Object v) { vars.put(k, v);          return this; }
        public Builder priority(Priority p)   { this.priority = p;      return this; }
        public Builder idempotencyKey(String k){ this.idempotencyKey = k; return this; }
        public Builder scheduleAt(Instant at){ this.scheduledAt = at;   return this; }

        public NotificationRequest build() {
            if (templateId == null || channel == null)
                throw new IllegalStateException("template & channel required");
            return new NotificationRequest(userId, templateId, channel,
                    Map.copyOf(vars), idempotencyKey, priority, scheduledAt);
        }
    }
}
Notification.java — Domain entity
public class Notification {
    private final String id;
    private final long userId;
    private final Channel channel;
    private final Priority priority;
    private final String renderedPayload;
    private volatile Status status;
    private final AtomicInteger attempts = new AtomicInteger(0);

    public Notification(String id, long u, Channel c, Priority p, String body) {
        this.id = id; this.userId = u; this.channel = c;
        this.priority = p; this.renderedPayload = body;
        this.status = Status.PENDING;
    }
    public int    incrementAttempts() { return attempts.incrementAndGet(); }
    public void   markSent()      { this.status = Status.SENT; }
    public void   markRetry()     { this.status = Status.RETRY; }
    public void   markDead()      { this.status = Status.DEAD; }
    public void   markSuppressed(){ this.status = Status.SUPPRESSED; }
    // getters omitted for brevity
    public String  getId()       { return id; }
    public long    getUserId()   { return userId; }
    public Channel getChannel()  { return channel; }
    public Priority getPriority(){ return priority; }
    public String  getPayload()  { return renderedPayload; }
    public Status  getStatus()   { return status; }
    public int     getAttempts() { return attempts.get(); }
}
ChannelSender.java — Strategy interface
public interface ChannelSender {
    DeliveryResult send(Notification n) throws SendException;
    Channel getChannel();
}

public record DeliveryResult(boolean success, String providerMsgId, long latencyMs) {}

public class SendException extends Exception {
    private final boolean retryable;
    public SendException(String msg, boolean retryable) {
        super(msg); this.retryable = retryable;
    }
    public boolean isRetryable() { return retryable; }
}
EmailSender.java — Concrete strategy
public class EmailSender implements ChannelSender {
    private final SendGridClient client;
    public EmailSender(SendGridClient c) { this.client = c; }

    @Override
    public DeliveryResult send(Notification n) throws SendException {
        long t0 = System.currentTimeMillis();
        try {
            String msgId = client.sendEmail(n.getUserId(), n.getPayload());
            return new DeliveryResult(true, msgId, System.currentTimeMillis() - t0);
        } catch (RateLimitedException | TimeoutException e) {
            throw new SendException(e.getMessage(), true);   // retryable
        } catch (InvalidAddressException e) {
            throw new SendException(e.getMessage(), false);  // permanent
        }
    }
    @Override public Channel getChannel() { return Channel.EMAIL; }
}
SenderFactory.java — Strategy registry
public class SenderFactory {
    private final Map<Channel, ChannelSender> registry = new EnumMap<>(Channel.class);

    public void register(ChannelSender s) { registry.put(s.getChannel(), s); }

    public ChannelSender get(Channel c) {
        ChannelSender s = registry.get(c);
        if (s == null) throw new IllegalArgumentException("No sender for " + c);
        return s;
    }
}
RetryPolicy.java — Exponential backoff
public interface RetryPolicy {
    boolean shouldRetry(int attemptNo, Throwable cause);
    Duration nextDelay(int attemptNo);
}

public class ExponentialBackoffPolicy implements RetryPolicy {
    private final int maxAttempts;
    private final Duration baseDelay;

    public ExponentialBackoffPolicy(int max, Duration base) {
        this.maxAttempts = max; this.baseDelay = base;
    }

    @Override
    public boolean shouldRetry(int attemptNo, Throwable cause) {
        if (attemptNo >= maxAttempts) return false;
        if (cause instanceof SendException se) return se.isRetryable();
        return true;
    }

    @Override
    public Duration nextDelay(int attemptNo) {
        long ms = baseDelay.toMillis() * (1L << attemptNo);          // 1s, 2s, 4s, 8s...
        long jitter = ThreadLocalRandom.current().nextLong(ms / 4);   // avoid thundering herd
        return Duration.ofMillis(ms + jitter);
    }
}
TokenBucketLimiter.java — Rate limiting
public class TokenBucketLimiter implements RateLimiter {
    private final Map<String, Bucket> buckets = new ConcurrentHashMap<>();
    private final long capacity;
    private final long refillPerSec;

    public TokenBucketLimiter(long capacity, long refillPerSec) {
        this.capacity = capacity; this.refillPerSec = refillPerSec;
    }

    @Override
    public boolean tryAcquire(String key) {
        return buckets
            .computeIfAbsent(key, k -> new Bucket(capacity, refillPerSec))
            .tryConsume();
    }

    private static class Bucket {
        private final long capacity, refillPerSec;
        private double tokens;
        private long lastRefillNs;

        Bucket(long cap, long rps) {
            this.capacity = cap; this.refillPerSec = rps;
            this.tokens = cap; this.lastRefillNs = System.nanoTime();
        }

        synchronized boolean tryConsume() {
            long now = System.nanoTime();
            double elapsedSec = (now - lastRefillNs) / 1_000_000_000.0;
            tokens = Math.min(capacity, tokens + elapsedSec * refillPerSec);
            lastRefillNs = now;
            if (tokens >= 1) { tokens -= 1; return true; }
            return false;
        }
    }
}
NotificationService.java — API plane façade
public class NotificationService {
    private final PreferenceService prefs;
    private final TemplateService   templates;
    private final NotificationRepository repo;
    private final QueuePublisher    publisher;
    private final IdempotencyStore  idempotency;

    public NotificationService(PreferenceService p, TemplateService t,
                               NotificationRepository r, QueuePublisher q,
                               IdempotencyStore i) {
        this.prefs = p; this.templates = t; this.repo = r;
        this.publisher = q; this.idempotency = i;
    }

    public String send(NotificationRequest req) {
        // 1. Idempotency — return existing if duplicate
        Optional<String> existing = idempotency.claim(req.idempotencyKey());
        if (existing.isPresent()) return existing.get();

        // 2. Preference + quiet-hours check (CRITICAL bypasses)
        Decision d = prefs.isAllowed(req.userId(), req.channel(), req.priority());
        if (d == Decision.SUPPRESS) {
            String id = repo.save(Notification.suppressed(req));
            return id;
        }

        // 3. Render template
        RenderedMessage body = templates.render(req.templateId(),
                                                  req.vars(),
                                                  prefs.getLocale(req.userId()));

        // 4. Persist + enqueue (in same transaction would be ideal — outbox pattern)
        Notification n = new Notification(UUID.randomUUID().toString(),
                                              req.userId(), req.channel(),
                                              req.priority(), body.payload());
        repo.save(n);
        publisher.publish(n);
        return n.getId();
    }
}
NotificationWorker.java — Worker plane
public class NotificationWorker {
    private final SenderFactory factory;
    private final RetryPolicy   retry;
    private final RateLimiter   limiter;
    private final NotificationRepository repo;
    private final QueuePublisher  publisher;
    private final DeadLetterQueue dlq;
    private final EventBus        events;

    public void consume(Notification n) {
        String userKey     = "user:" + n.getUserId();
        String providerKey = "provider:" + n.getChannel();

        if (!limiter.tryAcquire(userKey) || !limiter.tryAcquire(providerKey)) {
            publisher.publishDelayed(n, Duration.ofSeconds(1));   // shed back to queue
            return;
        }

        int attemptNo = n.incrementAttempts();
        try {
            DeliveryResult r = factory.get(n.getChannel()).send(n);
            n.markSent();
            repo.update(n);
            events.publish(new SentEvent(n.getId(), r.latencyMs()));
        } catch (SendException e) {
            if (retry.shouldRetry(attemptNo, e)) {
                n.markRetry();
                repo.update(n);
                publisher.publishDelayed(n, retry.nextDelay(attemptNo));
            } else {
                n.markDead();
                repo.update(n);
                dlq.add(n, e);
                events.publish(new DeadEvent(n.getId(), e.getMessage()));
            }
        }
    }
}
Demo.java — Wire it all up
public class Demo {
    public static void main(String[] args) {
        // Wire dependencies
        SenderFactory factory = new SenderFactory();
        factory.register(new EmailSender(new SendGridClient()));
        factory.register(new SMSSender(new TwilioClient()));
        factory.register(new PushSender(new FCMClient()));

        NotificationService svc = new NotificationService(
            new PreferenceService(),
            new TemplateService(),
            new NotificationRepository(),
            new QueuePublisher(),
            new IdempotencyStore()
        );

        // Order Service sends a confirmation
        String notifId = svc.send(
            NotificationRequest.builder()
                .to(42L)
                .template("ORDER_CONFIRMED_v3")
                .channel(Channel.EMAIL)
                .var("name", "Sarah")
                .var("orderId", "9876")
                .priority(Priority.HIGH)
                .idempotencyKey("order-9876-confirm")
                .build()
        );

        System.out.println("Tracking ID: " + notifId);

        // Duplicate request — returns same ID, no second send
        String dupe = svc.send(
            NotificationRequest.builder()
                .to(42L).template("ORDER_CONFIRMED_v3").channel(Channel.EMAIL)
                .idempotencyKey("order-9876-confirm")
                .build()
        );
        assert notifId.equals(dupe);   // idempotent ✓
    }
}
Step 11

Concurrency & Reliability

Three things will bite you in production. Address them in the design conversation, not after the interview ends.

🔒 Outbox Pattern

"INSERT to DB then publish to queue" is two operations. If the process crashes between them, you've lost a message. Fix: write both the notification row AND a "to_publish" outbox row in the same DB transaction. A separate publisher thread polls the outbox and pushes to Kafka, then deletes. Guarantees at-least-once delivery to the queue.

🔁 Idempotent Workers

At-least-once means the worker may see the same message twice (e.g., Kafka rebalance). Each consume() begins by checking SELECT status FROM notifications WHERE id = ?. If already SENT, skip. Otherwise, the first attempt that reaches the provider wins via a per-notification advisory lock.

🛡️ Circuit Breaker per Provider

If SendGrid starts returning 5xx for > 50% of calls in 30s, open the circuit — every email send fast-fails for the next 60s and the message stays queued. Without this, all worker threads pile up waiting on a dead provider, and the queue grows unbounded.

🧮 Per-User Ordering

Sarah expects her "order shipped" email AFTER her "order confirmed" email. Kafka partitioning by user_id ensures all of Sarah's notifications land on the same partition and are processed in order by a single worker thread.

⏱️ Scheduled Sends

For scheduled_at > now(), write to DB but don't queue immediately. A scheduler thread polls WHERE status = PENDING AND scheduled_at <= now() every 30s and enqueues due ones. For long delays, a delayed-message queue (Kafka with delay topic, or RabbitMQ TTL) avoids polling.

📊 Backpressure

Worker plane signals queue depth back to API plane. When email queue depth > threshold, API returns 429 Retry-After for low-priority email requests but still accepts critical ones. Prevents the queue from becoming an unbounded buffer of stale messages.

Step 12

Extension Points (Open/Closed)

The shape of the design is justified by how easily it absorbs new requirements without editing existing code.

➕ Add a new channel

Want Discord? Add DISCORD to the Channel enum, create DiscordSender implements ChannelSender, and register it with the SenderFactory at startup. Zero changes to NotificationService, Worker, or any other sender.

➕ Add a new retry strategy

Linear backoff for low-priority? Implement RetryPolicy as LinearBackoffPolicy, inject per-channel via config. Different channels can have different policies — push gets aggressive retries, email gets gentle ones.

➕ Add a new pre-send gate

Need GDPR consent verification before EU users get notifications? Add ConsentGate to the chain-of-responsibility pipeline between Preference and Template. No edits to existing gates.

➕ Add bulk / batch sends

Marketing wants to email 10M users at once. Add a BulkNotificationService that fans out to send() calls, throttled. The single-send path is unchanged.

➕ Add A/B template testing

TemplateService accepts an experiment context and returns variant A or B based on user bucketing. The render call interface stays the same; the experimentation logic is internal to TemplateService.

➕ Add multi-tenancy

For SaaS use, add tenantId to NotificationRequest, partition queues by tenant, and have per-tenant rate limits. The strategy/factory shape doesn't change — only the rate-limiter key includes tenant.

Step 13

Interview Q&A & Trade-offs

Real follow-ups you'll get after presenting this design. Each answer states the trade-off, not just the solution.

Why a queue between API and worker — why not call the provider directly?
Decoupling lifecycles. The API plane has a 50ms SLA to the calling microservice; SendGrid takes 200ms-30s. A queue lets the API return in 50ms regardless of provider latency, absorbs traffic spikes (10K orders/min becomes a queue, not 10K parallel HTTP calls), and gives us a natural retry buffer. Trade-off: we lose synchronous feedback ("did it actually send?"), so we add a status-tracking API and provider callbacks instead.
Why not store template rendering on the worker side?
Fail fast, fail at the edge. Rendering on the API side means a missing variable becomes a 400 to the caller (who can fix it immediately). Rendering on the worker means it becomes a DLQ entry an on-call sees three hours later. The API is also where we have full request context (locale, A/B bucket); the worker shouldn't care about that.
How do you handle the same notification being processed by two workers?
Idempotency at two layers. (1) Client-supplied idempotency_key at the API blocks duplicate requests. (2) At the worker, before calling the provider, we do a conditional update: UPDATE notifications SET status='SENDING', attempt=attempt+1 WHERE id=? AND status IN ('PENDING','RETRY'). Only the row that returned 1 row affected proceeds; the duplicate sees 0 rows and bails. Combined with Kafka's per-partition single-consumer guarantee, double-sends are vanishingly rare.
Provider charges per send. How do you avoid spending $10K when a bug fires 1M notifications?
Three layers of defense. (1) Per-user rate limit (Sarah max 5 SMS/hour) catches user-targeted floods. (2) Per-template rate limit (template X max 100K/hour) catches template-level bugs. (3) A circuit-breaker tied to "spend rate" — if our SendGrid spend exceeds $X/min, we pause low-priority sends and page on-call. Trade-off: occasional false positives during legitimate spikes, but cost incidents are unrecoverable.
SendGrid is down. What happens?
Three things in order. (1) Circuit breaker opens after error threshold — workers fast-fail and re-queue with backoff. (2) Email queue starts growing; metrics page on-call. (3) Optional: a fallback provider wired as a secondary EmailSender — if primary circuit is open, route to AWS SES. Trade-off: maintaining two integrations doubles ops cost; usually worth it only for transactional email, not marketing.
How do you handle priorities — should CRITICAL skip the queue?
Same queue infra, different topics. notifications.email.critical and notifications.email.normal as separate Kafka topics. Critical workers have higher concurrency and a dedicated provider account (so a marketing storm in normal doesn't starve OTPs). Critical also bypasses quiet-hours and per-user rate limits. Anti-pattern: using the same topic with a "priority" field — Kafka has no priority semantics, so consumers just process FIFO regardless.
How do you test this without hitting real providers?
The strategy interface is the seam. Inject a FakeEmailSender in tests that records sends to an in-memory list. The factory registry makes swap trivial. For integration tests against staging, use SendGrid's sandbox mode (returns success without actually emailing). For load tests, a NoOpSender returns immediately so we measure our own throughput, not the provider's.
What's the data retention story?
Two-tier storage. Hot: last 30 days in Postgres for support queries. Cold: archive older rows to S3 in Parquet for analytics and compliance. delivery_attempts rolls over faster (7 days) since it's high-volume and only useful for recent debugging. PII (email body, recipient address) is encrypted at rest; for GDPR deletion requests, we delete payload but keep the metadata row for audit.
A user wants their browser-push notifications to be quiet 10pm-7am. How is that enforced?
At the API plane, before queuing. The Preference Service computes "is right now in user 42's quiet hours, in user 42's timezone?". If yes and priority is < CRITICAL, the notification is marked SUPPRESSED immediately and never enters the worker pipeline. Why not delay-then-send? Because by morning, "your friend liked your photo from yesterday" is stale and unwelcome. Trade-off: time-sensitive messages near the boundary may get suppressed; we accept that to honor user intent.
The one-line summary the interviewer remembers: "It's a stateless API plane that validates and enqueues, plus an asynchronous worker plane with strategy-per-channel, retries with exponential backoff, idempotency at two layers, and a circuit breaker per provider. Adding a new channel is a single new class."