Multi-channel, retry-safe, template-driven — Strategy, Observer, Decorator, Chain of Responsibility & async workers, end-to-end
Imagine Sarah just placed an order on a food-delivery app. Within seconds, she expects a push notification on her phone, an email receipt in her inbox, and an SMS when her food is out for delivery. The Notification Service is the single backend system that fans every "something happened" event out to the right channel, in the right format, in the right language, while respecting Sarah's preferences and never spamming her if a downstream provider melts down. Let's pin down what it must do — and what we're explicitly leaving out.
Hi {{name}}, your order #{{id}} is on the way)Four kinds of clients touch this system. Upstream microservices (Order, Payment, Auth) are the ones generating events — they're 99% of the traffic. End users never call the service directly; they only see the notifications and update their preferences via a settings UI. Admins manage templates and watch dashboards. External providers (Twilio, SendGrid) call back into us with delivery receipts.
/notificationsBefore any class diagram, let's build up the architecture in three passes: the simplest thing that could work, why it falls over, and the production shape that earns every box.
Your first instinct: a single REST endpoint POST /send. The Order Service calls it, you connect to SendGrid synchronously, return 200, done. Three lines of code.
Now fast-forward to Black Friday. Three concrete failures:
SendGrid takes 30 seconds to time out. The Order Service is now blocked for 30 seconds per checkout. Customers see spinners and bounce. Synchronous coupling propagates failure upstream.
10K orders in one minute means 10K parallel SMS calls. Twilio rate-limits us to 100/sec and starts returning 429s. Every retry is a new call, amplifying the storm. No queue, no smoothing.
Sarah unsubscribed from marketing emails. Did the Order Service check that? The Loyalty Service? The Promo Service? Without a central preference layer, every caller re-implements the rule and someone gets it wrong. Compliance lawsuit waiting to happen.
The single most important split in this system is between accepting a notification request (must be fast, synchronous, returns an ID) and delivering it (slow, async, talks to flaky third parties). They have wildly different SLAs and scale independently. Once you internalize that split, every other component falls out naturally.
Light, stateless, fast. Validates the request, enriches it with user preferences & template, writes to DB, drops a job on a queue, returns 202 Accepted with a tracking ID. Latency budget: 50ms. Scales horizontally with load balancer.
What lives here: REST controllers, request validators, the Preference Service, the Template Service, the Idempotency check.
Heavy, stateful per-attempt, slow. Pulls jobs off the queue, picks the right channel handler, calls the provider, handles retries with backoff, writes the audit row. Latency budget: seconds, not ms. Scales independently per channel.
What lives here: channel workers (Email/SMS/Push), retry coordinator, rate limiter, dead-letter handler.
Now we draw the whole thing. Every box has a number — find its matching card below to see what it does, and crucially, what would break without it.
Use the numbers in the diagram above to find the matching card below. Each one answers: what is this?, why is it here?, and what breaks without it?
Any internal service that wants to notify a user — Order, Payment, Auth, Promotions. They speak to us via a thin SDK that wraps the REST API, so they don't need to know about queues, retries, or providers. They send one HTTP call and forget.
Solves: without a single notification service, every team would re-integrate Twilio/SendGrid themselves. We'd have 12 different retry policies and 12 different bugs.
Spreads incoming HTTP across all API instances. Standard nginx / AWS ALB. Health-checks the API pods every few seconds and yanks any that go unhealthy.
Solves: a single API instance dies under any real traffic. The LB lets us scale the API plane horizontally without clients knowing.
The stateless front door. Validates the request schema (channel valid? template exists? user_id present?), pulls the user's preferences, renders the template with the supplied variables, writes a row to the DB with status PENDING, and drops a job onto the right channel queue. Returns 202 Accepted with a notification_id the caller can use to track status. Total time: under 50ms.
Solves: if the API tried to call providers directly, every API instance would be tied up waiting on Twilio for seconds at a time. Decoupling write-to-queue from actually-send is what lets us absorb spikes.
Every request carries a client-supplied idempotency_key (e.g., order-9876-confirmation). Before doing any work, the API does SETNX key 1 EX 86400. If the key existed, it returns the original notification's ID without doing anything. Keys live for 24 hours.
Solves: microservices retry. If the Order Service calls us, times out, and retries, we'd send Sarah two emails — and she'd churn. Idempotency makes "retry" safe by default.
Holds Sarah's truth: she opted out of marketing SMS, prefers English, has quiet hours from 10pm to 7am IST. The API queries this for every send and either accepts, downgrades the channel (e.g., SMS → in-app), or drops it entirely with reason SUPPRESSED_BY_PREFERENCE.
Solves: compliance. GDPR, TCPA, CAN-SPAM all require honoring opt-outs. Centralizing this rule means one place to audit, not twelve places to debug.
Stores parameterized message templates by ID and version, e.g., ORDER_CONFIRMED_v3. The API passes in {name: "Sarah", order_id: "9876"} and gets back "Hi Sarah, your order #9876 is on the way!". Supports localization — picks the template variant matching the user's language.
Solves: without templates, the Order Service would hardcode message strings, and changing copy would require a deploy. With templates, the marketing team edits a row in a table.
The source of truth for every notification. Stores the rendered payload, recipient, channel, status (PENDING / SENT / DELIVERED / FAILED / DEAD), attempts, provider response, and timestamps. Every state change writes here. The API and the Worker both read & write it.
Solves: support tickets ("did Sarah get her receipt?"), audit trails, and the ability to replay any notification from history. Without a DB, a queue-only design loses everything the moment a message is consumed.
Kafka topics (or RabbitMQ exchanges) — one per channel: notifications.email, notifications.sms, notifications.push. The API publishes here; workers consume. Each topic is partitioned by user_id so all of Sarah's notifications land on the same partition (preserving order per user).
Solves: back-pressure. If SendGrid is slow today, the email topic fills up but the SMS workers keep humming. One slow channel can't block the others.
Token-bucket sitting between the queue and the workers. Two limits: per-user (Sarah gets max 5 SMS/hour to prevent harassment) and per-provider (we never exceed Twilio's 100/sec quota for our account). Backed by Redis.
Solves: two pains at once. Without per-user limits, a runaway loop in the Promo Service spams Sarah 1000 times. Without per-provider limits, we get our Twilio account temporarily banned.
A small in-worker router that looks at the message's channel field and hands it to the correct strategy class (EmailSender, SMSSender, PushSender). This is where the Strategy pattern lives — adding WhatsApp later means adding one new sender class and registering it here.
Solves: if every worker had giant if (channel == EMAIL) ... else if ... blocks, adding a channel would mean editing ten files. The dispatcher localizes that branching to one place.
One worker pool per channel. Pulls a message, calls its sender strategy, handles the response. On success: updates DB to SENT, records latency. On failure: increments attempt count, schedules a retry with exponential backoff, or routes to dead-letter if attempts exceeded. Each pool autoscales independently — push workers might be 10 pods, email workers might be 50.
Solves: independent scaling. Push notifications are 10× cheaper to send than emails, so we want different pool sizes. Lumping them into one pool means we can't tune them separately.
The actual delivery infrastructure: SendGrid for email, Twilio for SMS, FCM/APNS for mobile push. We treat them as black boxes behind our sender strategies. Each one has its own API quirks, error codes, and rate limits — those quirks live inside the corresponding sender class and never leak out.
Solves: we don't build email infra. SendGrid handles SPF, DKIM, deliverability, ISP relations — billions of dollars of work we'd never replicate.
Where messages go to die — but not silently. After N retry attempts (typically 5), the message is moved here with the full error history. An on-call engineer is paged if the DLQ depth crosses a threshold. From here, messages can be inspected, fixed, and replayed via the admin tool.
Solves: visibility into systemic failure. Without a DLQ, a bad template or an expired API key would silently fail forever and nobody would notice until customers complained.
A separate webhook endpoint that providers call back asynchronously to tell us "this message was actually delivered to the device" or "the user opened it" or "the user clicked the link". Updates the DB row from SENT to DELIVERED / OPENED / CLICKED.
Solves: "did SendGrid take it" ≠ "did it land in Sarah's inbox". Read receipts give product teams real engagement data and tell us when a phone is unreachable so we can stop trying.
Every state change emits an event to a metrics pipeline (Prometheus + Grafana for live dashboards, ClickHouse for historical queries). Tracks success rate per channel, per provider, per template — surfacing things like "SMS to India dropped from 99% to 84% delivery this hour".
Solves: operating blind. You only know your provider switched their behavior when graphs change shape. Without dashboards, the first signal is angry customers.
Sarah hits "Place Order" at 14:02:06. Here's exactly what happens, mapped to the numbered components above:
{user_id: 42, template: "ORDER_CONFIRMED_v3", channel: "EMAIL", vars: {...}, idempotency_key: "order-9876-confirm"} to the Load Balancer ②.PENDING, then publishes the message to the notifications.email topic in Kafka ⑧, partitioned by user_id=42. Returns 202 {notification_id: "n_abc123"} to the Order Service. Elapsed: 38ms.202 Accepted in 740ms.SENT, emits a notification.sent event to Audit ⑮.delivered. We update the row to DELIVERED.clicked. Row updates to CLICKED. Marketing now knows the email worked.Now we map the architecture to objects. The trick is to find the nouns ("notification", "template", "channel", "user preference") and the things that vary ("which channel?", "which retry policy?"). Variations become enums or strategy classes; nouns become entities.
| Entity | Responsibility | Key Fields |
|---|---|---|
Notification | One send attempt for one user on one channel — the unit of work | id, userId, channel, templateId, payload, status, attempts, createdAt |
NotificationRequest | The inbound API DTO — what callers actually post | userId, templateId, channel, vars, idempotencyKey, priority, scheduleAt |
Template | A reusable, parameterized message body per channel + locale | id, version, channel, locale, subject, body, requiredVars[] |
UserPreference | Per-user channel opt-ins, language, quiet-hours window | userId, channelOptIns Map, language, quietStart, quietEnd, timezone |
Recipient | Resolved contact info for a user — email, phone, deviceTokens | userId, email, phone, deviceTokens List, slackId |
DeliveryAttempt | One try at sending — captures provider response & latency | id, notificationId, attemptNo, providerCode, status, errorMsg, latencyMs, attemptedAt |
ChannelSender | Strategy interface — one impl per channel (Email/SMS/Push/...) | + send(notif): Result, + getChannel(): Channel |
RetryPolicy | Decides if & when to retry after a failure | + shouldRetry(attemptNo, error): boolean, + nextDelay(attemptNo): Duration |
RateLimiter | Token-bucket gate — per user & per provider | + tryAcquire(key): boolean |
EMAIL, SMS, PUSH, IN_APP, WHATSAPP, SLACK
Each maps to one ChannelSender strategy. Adding a new channel = adding one enum constant + one sender class.
PENDING → SENT → DELIVERED → OPENED / CLICKED
Sad path: PENDING → FAILED → RETRY → ... → DEAD or SUPPRESSED when preferences block.
CRITICAL, HIGH, NORMAL, LOW
Critical (OTP, payment receipts) skips quiet-hours and rate limits. Low (marketing) gets dropped first when overloaded.
Before classes, lock down the data model. Notice how delivery_attempt is a separate table — every retry gets a new row, so we have a complete forensic trail when things go wrong. The idempotency_key is unique per client, preventing duplicate sends.
(user_id, created_at DESC) for "show me Sarah's recent notifications". (status, scheduled_at) for the worker poll query. idempotency_key needs a UNIQUE index. (template_id, locale) for template lookup.The NotificationService is the façade. It composes a Preference checker, a Template renderer, and a SenderFactory that resolves the right strategy per channel. ChannelSender is the strategy interface — each concrete sender is a swappable implementation.
This problem is a pattern playground because the variation points are loud and clear. Each pattern below is paired with the exact question it answers.
Question it answers: "How do I send via email vs SMS vs push without a giant if-else?"
Each channel implements the same ChannelSender interface. The worker holds a reference to the interface and never knows which concrete sender it's calling. Adding WhatsApp = one new class, zero edits to existing code (Open/Closed).
Question it answers: "Given a channel enum, how do I get the right sender?"
Maintains a Map<Channel, ChannelSender> populated at startup. The worker calls factory.get(channel) instead of newing senders. Makes mocking in tests trivial.
Question it answers: "How do I emit metrics & audit logs without coupling them to send logic?"
The worker publishes lifecycle events (SENT, FAILED, DELIVERED) to a topic. Audit, Metrics, and Webhook subscribers each consume independently. Adding a new subscriber doesn't touch send code.
Question it answers: "Where do I put rate-limiting, logging, and circuit-breaking without polluting senders?"
Wrap a ChannelSender in RateLimitedSender, then in LoggingSender, then in CircuitBreakerSender. Each layer adds one concern. Composition order is configurable.
Question it answers: "How do I run a sequence of validations (preferences → quiet hours → rate limit → template) cleanly?"
Each gate is a handler that either passes the notification along or short-circuits with SUPPRESSED. Easy to reorder, easy to add a new gate.
Question it answers: "How do I construct a request with 8 optional fields without a 12-arg constructor?"
Fluent builder: NotificationRequest.builder().to(42).template("X").channel(EMAIL).var("name", "Sarah").priority(HIGH).build().
Question it answers: "How many SenderFactory instances should exist?"
One. The factory holds the registry of all channels. In production this is a Spring bean; in pure Java it's an enum-based singleton.
Question it answers: "Every sender does the same setup/teardown — how do I avoid duplicating it?"
Abstract base class implements send() as: validate → call doSend() hook → record metrics. Concrete senders only override doSend().
A notification's life is more nuanced than "sent / not sent". Modeling every state explicitly lets us answer support questions like "did Sarah open the email?" and lets the worker safely resume after a crash by inspecting the state.
The happy-path send, end-to-end. Notice how the API returns to the caller before the provider is even called — that's the API/Worker split paying off.
Pure Java 17, no frameworks. Order: enums → records → strategy interface → concrete senders → factory → service → worker → demo.
Channel.java — Enumpublic enum Channel { EMAIL, SMS, PUSH, IN_APP, WHATSAPP, SLACK }Priority.java & Status.java
public enum Priority { CRITICAL, HIGH, NORMAL, LOW } public enum Status { PENDING, SUPPRESSED, SENT, RETRY, DEAD, DELIVERED, BOUNCED, OPENED, CLICKED }NotificationRequest.java — Record + Builder
import java.time.Instant; import java.util.*; public record NotificationRequest( long userId, String templateId, Channel channel, Map<String, Object> vars, String idempotencyKey, Priority priority, Instant scheduledAt ) { public static Builder builder() { return new Builder(); } public static class Builder { private long userId; private String templateId; private Channel channel; private Map<String, Object> vars = new HashMap<>(); private String idempotencyKey = UUID.randomUUID().toString(); private Priority priority = Priority.NORMAL; private Instant scheduledAt = Instant.now(); public Builder to(long u) { this.userId = u; return this; } public Builder template(String t) { this.templateId = t; return this; } public Builder channel(Channel c) { this.channel = c; return this; } public Builder var(String k, Object v) { vars.put(k, v); return this; } public Builder priority(Priority p) { this.priority = p; return this; } public Builder idempotencyKey(String k){ this.idempotencyKey = k; return this; } public Builder scheduleAt(Instant at){ this.scheduledAt = at; return this; } public NotificationRequest build() { if (templateId == null || channel == null) throw new IllegalStateException("template & channel required"); return new NotificationRequest(userId, templateId, channel, Map.copyOf(vars), idempotencyKey, priority, scheduledAt); } } }Notification.java — Domain entity
public class Notification { private final String id; private final long userId; private final Channel channel; private final Priority priority; private final String renderedPayload; private volatile Status status; private final AtomicInteger attempts = new AtomicInteger(0); public Notification(String id, long u, Channel c, Priority p, String body) { this.id = id; this.userId = u; this.channel = c; this.priority = p; this.renderedPayload = body; this.status = Status.PENDING; } public int incrementAttempts() { return attempts.incrementAndGet(); } public void markSent() { this.status = Status.SENT; } public void markRetry() { this.status = Status.RETRY; } public void markDead() { this.status = Status.DEAD; } public void markSuppressed(){ this.status = Status.SUPPRESSED; } // getters omitted for brevity public String getId() { return id; } public long getUserId() { return userId; } public Channel getChannel() { return channel; } public Priority getPriority(){ return priority; } public String getPayload() { return renderedPayload; } public Status getStatus() { return status; } public int getAttempts() { return attempts.get(); } }ChannelSender.java — Strategy interface
public interface ChannelSender { DeliveryResult send(Notification n) throws SendException; Channel getChannel(); } public record DeliveryResult(boolean success, String providerMsgId, long latencyMs) {} public class SendException extends Exception { private final boolean retryable; public SendException(String msg, boolean retryable) { super(msg); this.retryable = retryable; } public boolean isRetryable() { return retryable; } }EmailSender.java — Concrete strategy
public class EmailSender implements ChannelSender { private final SendGridClient client; public EmailSender(SendGridClient c) { this.client = c; } @Override public DeliveryResult send(Notification n) throws SendException { long t0 = System.currentTimeMillis(); try { String msgId = client.sendEmail(n.getUserId(), n.getPayload()); return new DeliveryResult(true, msgId, System.currentTimeMillis() - t0); } catch (RateLimitedException | TimeoutException e) { throw new SendException(e.getMessage(), true); // retryable } catch (InvalidAddressException e) { throw new SendException(e.getMessage(), false); // permanent } } @Override public Channel getChannel() { return Channel.EMAIL; } }SenderFactory.java — Strategy registry
public class SenderFactory { private final Map<Channel, ChannelSender> registry = new EnumMap<>(Channel.class); public void register(ChannelSender s) { registry.put(s.getChannel(), s); } public ChannelSender get(Channel c) { ChannelSender s = registry.get(c); if (s == null) throw new IllegalArgumentException("No sender for " + c); return s; } }RetryPolicy.java — Exponential backoff
public interface RetryPolicy { boolean shouldRetry(int attemptNo, Throwable cause); Duration nextDelay(int attemptNo); } public class ExponentialBackoffPolicy implements RetryPolicy { private final int maxAttempts; private final Duration baseDelay; public ExponentialBackoffPolicy(int max, Duration base) { this.maxAttempts = max; this.baseDelay = base; } @Override public boolean shouldRetry(int attemptNo, Throwable cause) { if (attemptNo >= maxAttempts) return false; if (cause instanceof SendException se) return se.isRetryable(); return true; } @Override public Duration nextDelay(int attemptNo) { long ms = baseDelay.toMillis() * (1L << attemptNo); // 1s, 2s, 4s, 8s... long jitter = ThreadLocalRandom.current().nextLong(ms / 4); // avoid thundering herd return Duration.ofMillis(ms + jitter); } }TokenBucketLimiter.java — Rate limiting
public class TokenBucketLimiter implements RateLimiter { private final Map<String, Bucket> buckets = new ConcurrentHashMap<>(); private final long capacity; private final long refillPerSec; public TokenBucketLimiter(long capacity, long refillPerSec) { this.capacity = capacity; this.refillPerSec = refillPerSec; } @Override public boolean tryAcquire(String key) { return buckets .computeIfAbsent(key, k -> new Bucket(capacity, refillPerSec)) .tryConsume(); } private static class Bucket { private final long capacity, refillPerSec; private double tokens; private long lastRefillNs; Bucket(long cap, long rps) { this.capacity = cap; this.refillPerSec = rps; this.tokens = cap; this.lastRefillNs = System.nanoTime(); } synchronized boolean tryConsume() { long now = System.nanoTime(); double elapsedSec = (now - lastRefillNs) / 1_000_000_000.0; tokens = Math.min(capacity, tokens + elapsedSec * refillPerSec); lastRefillNs = now; if (tokens >= 1) { tokens -= 1; return true; } return false; } } }NotificationService.java — API plane façade
public class NotificationService { private final PreferenceService prefs; private final TemplateService templates; private final NotificationRepository repo; private final QueuePublisher publisher; private final IdempotencyStore idempotency; public NotificationService(PreferenceService p, TemplateService t, NotificationRepository r, QueuePublisher q, IdempotencyStore i) { this.prefs = p; this.templates = t; this.repo = r; this.publisher = q; this.idempotency = i; } public String send(NotificationRequest req) { // 1. Idempotency — return existing if duplicate Optional<String> existing = idempotency.claim(req.idempotencyKey()); if (existing.isPresent()) return existing.get(); // 2. Preference + quiet-hours check (CRITICAL bypasses) Decision d = prefs.isAllowed(req.userId(), req.channel(), req.priority()); if (d == Decision.SUPPRESS) { String id = repo.save(Notification.suppressed(req)); return id; } // 3. Render template RenderedMessage body = templates.render(req.templateId(), req.vars(), prefs.getLocale(req.userId())); // 4. Persist + enqueue (in same transaction would be ideal — outbox pattern) Notification n = new Notification(UUID.randomUUID().toString(), req.userId(), req.channel(), req.priority(), body.payload()); repo.save(n); publisher.publish(n); return n.getId(); } }NotificationWorker.java — Worker plane
public class NotificationWorker { private final SenderFactory factory; private final RetryPolicy retry; private final RateLimiter limiter; private final NotificationRepository repo; private final QueuePublisher publisher; private final DeadLetterQueue dlq; private final EventBus events; public void consume(Notification n) { String userKey = "user:" + n.getUserId(); String providerKey = "provider:" + n.getChannel(); if (!limiter.tryAcquire(userKey) || !limiter.tryAcquire(providerKey)) { publisher.publishDelayed(n, Duration.ofSeconds(1)); // shed back to queue return; } int attemptNo = n.incrementAttempts(); try { DeliveryResult r = factory.get(n.getChannel()).send(n); n.markSent(); repo.update(n); events.publish(new SentEvent(n.getId(), r.latencyMs())); } catch (SendException e) { if (retry.shouldRetry(attemptNo, e)) { n.markRetry(); repo.update(n); publisher.publishDelayed(n, retry.nextDelay(attemptNo)); } else { n.markDead(); repo.update(n); dlq.add(n, e); events.publish(new DeadEvent(n.getId(), e.getMessage())); } } } }Demo.java — Wire it all up
public class Demo { public static void main(String[] args) { // Wire dependencies SenderFactory factory = new SenderFactory(); factory.register(new EmailSender(new SendGridClient())); factory.register(new SMSSender(new TwilioClient())); factory.register(new PushSender(new FCMClient())); NotificationService svc = new NotificationService( new PreferenceService(), new TemplateService(), new NotificationRepository(), new QueuePublisher(), new IdempotencyStore() ); // Order Service sends a confirmation String notifId = svc.send( NotificationRequest.builder() .to(42L) .template("ORDER_CONFIRMED_v3") .channel(Channel.EMAIL) .var("name", "Sarah") .var("orderId", "9876") .priority(Priority.HIGH) .idempotencyKey("order-9876-confirm") .build() ); System.out.println("Tracking ID: " + notifId); // Duplicate request — returns same ID, no second send String dupe = svc.send( NotificationRequest.builder() .to(42L).template("ORDER_CONFIRMED_v3").channel(Channel.EMAIL) .idempotencyKey("order-9876-confirm") .build() ); assert notifId.equals(dupe); // idempotent ✓ } }
Three things will bite you in production. Address them in the design conversation, not after the interview ends.
"INSERT to DB then publish to queue" is two operations. If the process crashes between them, you've lost a message. Fix: write both the notification row AND a "to_publish" outbox row in the same DB transaction. A separate publisher thread polls the outbox and pushes to Kafka, then deletes. Guarantees at-least-once delivery to the queue.
At-least-once means the worker may see the same message twice (e.g., Kafka rebalance). Each consume() begins by checking SELECT status FROM notifications WHERE id = ?. If already SENT, skip. Otherwise, the first attempt that reaches the provider wins via a per-notification advisory lock.
If SendGrid starts returning 5xx for > 50% of calls in 30s, open the circuit — every email send fast-fails for the next 60s and the message stays queued. Without this, all worker threads pile up waiting on a dead provider, and the queue grows unbounded.
Sarah expects her "order shipped" email AFTER her "order confirmed" email. Kafka partitioning by user_id ensures all of Sarah's notifications land on the same partition and are processed in order by a single worker thread.
For scheduled_at > now(), write to DB but don't queue immediately. A scheduler thread polls WHERE status = PENDING AND scheduled_at <= now() every 30s and enqueues due ones. For long delays, a delayed-message queue (Kafka with delay topic, or RabbitMQ TTL) avoids polling.
Worker plane signals queue depth back to API plane. When email queue depth > threshold, API returns 429 Retry-After for low-priority email requests but still accepts critical ones. Prevents the queue from becoming an unbounded buffer of stale messages.
The shape of the design is justified by how easily it absorbs new requirements without editing existing code.
Want Discord? Add DISCORD to the Channel enum, create DiscordSender implements ChannelSender, and register it with the SenderFactory at startup. Zero changes to NotificationService, Worker, or any other sender.
Linear backoff for low-priority? Implement RetryPolicy as LinearBackoffPolicy, inject per-channel via config. Different channels can have different policies — push gets aggressive retries, email gets gentle ones.
Need GDPR consent verification before EU users get notifications? Add ConsentGate to the chain-of-responsibility pipeline between Preference and Template. No edits to existing gates.
Marketing wants to email 10M users at once. Add a BulkNotificationService that fans out to send() calls, throttled. The single-send path is unchanged.
TemplateService accepts an experiment context and returns variant A or B based on user bucketing. The render call interface stays the same; the experimentation logic is internal to TemplateService.
For SaaS use, add tenantId to NotificationRequest, partition queues by tenant, and have per-tenant rate limits. The strategy/factory shape doesn't change — only the rate-limiter key includes tenant.
Real follow-ups you'll get after presenting this design. Each answer states the trade-off, not just the solution.
idempotency_key at the API blocks duplicate requests. (2) At the worker, before calling the provider, we do a conditional update: UPDATE notifications SET status='SENDING', attempt=attempt+1 WHERE id=? AND status IN ('PENDING','RETRY'). Only the row that returned 1 row affected proceeds; the duplicate sees 0 rows and bails. Combined with Kafka's per-partition single-consumer guarantee, double-sends are vanishingly rare.X max 100K/hour) catches template-level bugs. (3) A circuit-breaker tied to "spend rate" — if our SendGrid spend exceeds $X/min, we pause low-priority sends and page on-call. Trade-off: occasional false positives during legitimate spikes, but cost incidents are unrecoverable.notifications.email.critical and notifications.email.normal as separate Kafka topics. Critical workers have higher concurrency and a dedicated provider account (so a marketing storm in normal doesn't starve OTPs). Critical also bypasses quiet-hours and per-user rate limits. Anti-pattern: using the same topic with a "priority" field — Kafka has no priority semantics, so consumers just process FIFO regardless.FakeEmailSender in tests that records sends to an in-memory list. The factory registry makes swap trivial. For integration tests against staging, use SendGrid's sandbox mode (returns success without actually emailing). For load tests, a NoOpSender returns immediately so we measure our own throughput, not the provider's.delivery_attempts rolls over faster (7 days) since it's high-volume and only useful for recent debugging. PII (email body, recipient address) is encrypted at rest; for GDPR deletion requests, we delete payload but keep the metadata row for audit.SUPPRESSED immediately and never enters the worker pipeline. Why not delay-then-send? Because by morning, "your friend liked your photo from yesterday" is stale and unwelcome. Trade-off: time-sensitive messages near the boundary may get suppressed; we accept that to honor user intent.