← Back to Design & Development
BACKEND INTERVIEW · SYSTEM DESIGN

Kafka Internals, Debugging
& Payment Idempotency

Three must-know interview topics — explained with diagrams, commands, and code
Q1

Discuss the basic working of Kafka and its various components

Overview

What is Apache Kafka?

Kafka is a distributed, fault-tolerant, high-throughput event streaming platform. Think of it as a durable commit log where producers write events and consumers read them — at their own pace, without losing data.

                            KAFKA CLUSTER
  ┌────────┐         ┌────────────────────────────────────┐         ┌────────┐
  │Producer│────────►│                                    │────────►│Consumer│
  │   A    │  write  │   Broker 1    Broker 2    Broker 3 │  read   │ Group 1│
  └────────┘         │   ┌──────┐   ┌──────┐   ┌──────┐   │         └────────┘
  ┌────────┐         │   │ P0   │   │ P1   │   │ P2   │   │         ┌────────┐
  │Producer│────────►│   │(Lead)│   │(Lead)│   │(Lead)│   │────────►│Consumer│
  │   B    │         │   │ P1   │   │ P2   │   │ P0   │   │         │ Group 2│
  └────────┘         │   │(Rep) │   │(Rep) │   │(Rep) │   │         └────────┘
                     │   └──────┘   └──────┘   └──────┘   │
                     │                                    │
                     │          ZooKeeper / KRaft         │
                     │       (metadata & coordination)    │
                     └────────────────────────────────────┘

  P0, P1, P2 = Partitions of a Topic
  (Lead) = Leader partition    (Rep) = Replica/Follower
Components

Core Components Explained

1. Producer

Publishes messages (events) to a Kafka topic. Decides which partition to send to — via round-robin, key-based hashing, or custom partitioner. Supports batching and compression.

2. Consumer

Reads messages from topic partitions. Tracks its position using offsets. Part of a Consumer Group for parallel processing. Can auto-commit or manually commit offsets.

3. Broker

A single Kafka server that stores data and serves clients. A cluster has multiple brokers for fault-tolerance. Each broker handles read/write for the partitions it leads.

4. Topic

A logical category/feed name to which messages are published. Think of it as a "table" in a database. Messages in a topic are immutable and append-only.

5. Partition

A topic is split into partitions for parallelism. Each partition is an ordered, immutable log. Messages within a partition have a sequential offset. Partitions are distributed across brokers.

6. Consumer Group

A set of consumers that cooperatively read from a topic. Each partition is assigned to exactly one consumer in the group. Enables horizontal scaling of consumption.

7. Offset

A unique sequential ID for each message within a partition. Consumers track their offset to know where they left off. Stored in an internal topic: __consumer_offsets.

8. ZooKeeper / KRaft

Manages cluster metadata: broker registration, topic configs, partition leaders. KRaft (Kafka Raft) is the new built-in replacement for ZooKeeper, removing the external dependency.

Message Flow

How a Message Travels Through Kafka

Producer serializes the message

Key + Value are serialized (JSON, Avro, Protobuf). If a key is present, it's hashed to determine the target partition. No key → round-robin across partitions.

Message sent to the Partition Leader broker

Producer's metadata cache knows which broker leads each partition. Message is sent directly to that broker. Batching and compression happen here for throughput.

Leader writes to its commit log

The message is appended to the partition's on-disk log. It gets assigned a unique offset (e.g., offset 42). The write is sequential I/O — this is why Kafka is fast.

Replicas (followers) pull from the leader

Follower brokers fetch the new message and write it to their local log. Once enough replicas confirm (based on acks config), the message is "committed."

Leader sends acknowledgment to producer

acks=0: fire-and-forget (no ack). acks=1: leader confirms. acks=all: all ISR replicas confirm. Tradeoff: durability vs latency.

Consumer polls the broker for new messages

Consumer sends a fetch request with its current offset. Broker returns messages from that offset onwards. Consumer processes them and commits the new offset.

Key Concepts

Important Concepts Interviewers Expect You to Know

  ┌─────────────── PARTITION 0 ────────────────────────────────┐
  │ offset: 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ 8 │ ...        │
  │           │   │   │   │   │   │   │   │   │              │
  │ ◄─── old messages ──── │ ──── new messages ───►           │
  │                         │                                  │
  │                    committed                               │
  │                     offset                                 │
  │                    (Consumer                               │
  │                     position)                              │
  └────────────────────────────────────────────────────────────┘

  CONSUMER GROUP ASSIGNMENT:
  ┌─────────────────────────────────────────────┐
  │ Topic: "orders" (3 partitions)              │
  │                                             │
  │   Consumer Group "payment-service":         │
  │     Consumer A  ← reads P0                  │
  │     Consumer B  ← reads P1                  │
  │     Consumer C  ← reads P2                  │
  │                                             │
  │   Consumer Group "analytics-service":       │
  │     Consumer X  ← reads P0, P1             │
  │     Consumer Y  ← reads P2                  │
  │                                             │
  │   ⚠ If consumers > partitions:              │
  │     Extra consumers sit IDLE                │
  └─────────────────────────────────────────────┘

  ISR (In-Sync Replicas):
  ┌─────────────────────────────────────────────┐
  │  Broker 1 (Leader)  → offset 100  ✓ ISR    │
  │  Broker 2 (Follower)→ offset 99   ✓ ISR    │
  │  Broker 3 (Follower)→ offset 85   ✗ OUT    │
  │                                             │
  │  replica.lag.max.messages controls          │
  │  when a follower is kicked from ISR         │
  └─────────────────────────────────────────────┘
ConfigWhat it controlsTrade-off
acksHow many replicas must confirm a writeDurability vs latency
replication.factorNumber of copies of each partitionFault-tolerance vs storage
min.insync.replicasMinimum ISR count to accept a writeAvailability vs consistency
retention.msHow long messages are keptStorage vs replay ability
num.partitionsParallelism of a topicThroughput vs ordering
enable.auto.commitAuto vs manual offset commitsConvenience vs exactly-once
max.poll.recordsMax messages per consumer poll()Throughput vs processing time
Q2

How to debug if a producer publishes messages but a consumer cannot consume them?

Approach

Systematic Debugging Checklist

Follow this order — from most common to least common causes. In an interview, walk through this methodically.

Verify the producer is actually writing successfully

Don't assume the producer is working. Check producer logs for errors. Verify with the console consumer or by checking topic offsets.

kafka-console-consumer --bootstrap-server localhost:9092 --topic orders --from-beginning kafka-run-class kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic orders

Check if the consumer is in the correct Consumer Group

A very common issue — the consumer is subscribed to a different group ID, so it's reading from a different set of offsets, or it was already consumed and committed those offsets before.

kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group payment-service

Check consumer group LAG

If LAG = 0 but consumer sees no messages, it means all messages were already consumed (offset is at the end). If LAG > 0 but consumer isn't reading, the consumer is likely stuck or disconnected.

kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group payment-service # Output columns: TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID # If LAG keeps growing → consumer is slow or dead # If CURRENT-OFFSET = LOG-END-OFFSET → nothing new to read

Check topic & partition assignment

Is the consumer assigned to partitions? If the group has more consumers than partitions, some consumers will be idle. Also check if the consumer is subscribed to the correct topic name (typo!).

# In consumer logs, look for: # "Assigned partitions: [orders-0, orders-1]" # If "Assigned partitions: []" → consumer is idle or rebalancing

Check auto.offset.reset config

If a consumer group is NEW (no committed offsets) and auto.offset.reset=latest, it will only see messages published AFTER it started. Old messages are invisible. Set to earliest to consume from the beginning.

# In consumer config: # auto.offset.reset=earliest → start from offset 0 # auto.offset.reset=latest → start from latest (miss old messages) # auto.offset.reset=none → throw exception if no offset found

Check for consumer rebalancing loops

If processing takes too long, the consumer fails to call poll() within max.poll.interval.ms, gets kicked from the group, rejoins, gets reassigned → infinite loop. Look for "Member X has been removed" in logs.

# Fix: increase max.poll.interval.ms or decrease max.poll.records # Or fix the slow processing in your consumer code

Check for deserialization errors

Producer writes Avro, consumer expects JSON → deserialization fails silently or throws. Consumer might be crashing on every message without making progress.

# Check consumer logs for: # "SerializationException" or "DeserializationException" # Ensure producer serializer matches consumer deserializer

Check network / ACL / authentication

Consumer might not have permission to read from the topic (ACL issue), or there's a firewall between consumer and broker. Also check if SSL/SASL configs match.

# Check broker logs for: # "Authorization failed" or "Authentication failed" # kafka-acls --list --topic orders --bootstrap-server localhost:9092

Check if messages have expired (retention)

If retention.ms is short and consumer was offline for too long, messages were deleted before the consumer could read them. The consumer's committed offset now points to a deleted segment.

# Check topic retention: # kafka-configs --describe --entity-type topics --entity-name orders --bootstrap-server localhost:9092
Quick Reference

Diagnosis Matrix

SymptomLikely CauseFix
LAG = 0, no messagesAlready consumed or auto.offset.reset=latestReset offsets or use earliest
LAG growing, consumer activeSlow processing or rebalancingIncrease poll interval, reduce batch size
LAG growing, no active consumerConsumer crashed or disconnectedCheck consumer logs, restart
Consumer assigned 0 partitionsMore consumers than partitions, or wrong topicScale down consumers or add partitions
Constant rebalancingProcessing exceeds max.poll.interval.msIncrease timeout or optimize processing
Deserialization errors in logsSchema mismatch between producer/consumerAlign serializers, use Schema Registry
Consumer starts but reads nothingACL denied, wrong bootstrap server, or wrong groupCheck permissions, config, network
Interview Tip: Start by saying "I'd approach this systematically" and walk through producer verification → consumer group check → lag analysis → config check. This shows structured thinking.
Q3

In a payment system, how do you ensure retries don't deduct balance multiple times?

The Problem

Why Retries Cause Double Deductions

  THE PROBLEM SCENARIO:
  ═══════════════════════════════════════════════════════════════

  Client                    Payment Service               Database
    │                              │                          │
    │──── POST /pay $100 ────────►│                          │
    │                              │── BEGIN TX ──────────────►│
    │                              │   UPDATE balance -= 100   │
    │                              │── COMMIT ────────────────►│
    │                              │                          │
    │    ✗ TIMEOUT (network)       │◄── 200 OK ──────────────│
    │    (client never got         │    (but response lost     │
    │     the response)            │     in network)           │
    │                              │                          │
    │──── POST /pay $100 ────────►│                          │
    │     (RETRY! same request)    │── BEGIN TX ──────────────►│
    │                              │   UPDATE balance -= 100   │
    │                              │── COMMIT ────────────────►│  ⚠ DOUBLE
    │◄── 200 OK ──────────────────│◄── OK ───────────────────│    DEDUCTION!
    │                              │                          │

  Balance was 500.
  After 1st request: 400 ✓
  After retry:       300 ✗  ← User lost $100 extra!
Solution

Idempotency Key — The Industry Standard

The core idea: every payment request carries a unique idempotency_key. Before processing, we check if we've already processed this key. If yes, return the cached response. If no, process and store the result.

  THE SOLUTION:
  ═══════════════════════════════════════════════════════════════

  Client                    Payment Service               Database
    │                              │                          │
    │  POST /pay $100              │                          │
    │  Idempotency-Key: "abc-123" │                          │
    │─────────────────────────────►│                          │
    │                              │── Check: "abc-123"       │
    │                              │   exists in              │
    │                              │   idempotency_store? ───►│
    │                              │◄── NO ──────────────────│
    │                              │                          │
    │                              │── BEGIN TX ──────────────►│
    │                              │   1. INSERT into          │
    │                              │      idempotency_store    │
    │                              │      (key, status,result) │
    │                              │   2. UPDATE balance -=100 │
    │                              │── COMMIT ────────────────►│
    │                              │                          │
    │  ✗ TIMEOUT                   │◄── OK ──────────────────│
    │                              │                          │
    │  POST /pay $100              │                          │
    │  Idempotency-Key: "abc-123" │                          │
    │─────────────────────────────►│                          │
    │                              │── Check: "abc-123"       │
    │                              │   exists? ──────────────►│
    │                              │◄── YES, return cached ──│
    │                              │                          │
    │◄── 200 OK (cached) ─────────│    ✅ NO DOUBLE          │
    │    (same response as 1st)    │       DEDUCTION!         │

  Balance: 400 ✓ (deducted only once)
Implementation

Database Schema & Code

idempotency_store table
-- Stores the result of each unique payment request
CREATE TABLE idempotency_store (
    idempotency_key   VARCHAR(255) PRIMARY KEY,
    status            VARCHAR(20),   -- PROCESSING, SUCCESS, FAILED
    response_code     INT,
    response_body     TEXT,           -- cached JSON response
    created_at        TIMESTAMP DEFAULT NOW(),
    expires_at        TIMESTAMP        -- TTL: auto-cleanup old keys
);

-- UNIQUE constraint on key ensures no duplicates even
-- under race conditions (DB-level safety net)
PaymentService.java — Idempotent Handler
public class PaymentService {

    @Transactional
    public PaymentResponse processPayment(PaymentRequest req) {
        String key = req.getIdempotencyKey();

        // ── Step 1: Check if already processed ──
        Optional<IdempotencyRecord> existing =
            idempotencyRepo.findByKey(key);

        if (existing.isPresent()) {
            IdempotencyRecord record = existing.get();

            if (record.getStatus() == Status.PROCESSING) {
                // Another thread is processing this right now
                throw new ConflictException(
                    "Payment is being processed. Retry later.");
            }

            // Already completed — return cached response
            return record.getCachedResponse();
        }

        // ── Step 2: Claim this key (with PROCESSING status) ──
        try {
            idempotencyRepo.insert(new IdempotencyRecord(
                key, Status.PROCESSING, null
            ));
        } catch (DuplicateKeyException e) {
            // Race condition: another thread inserted first
            throw new ConflictException("Duplicate request");
        }

        // ── Step 3: Process the actual payment ──
        try {
            validateBalance(req.getUserId(), req.getAmount());
            deductBalance(req.getUserId(), req.getAmount());
            createTransaction(req);

            PaymentResponse response = new PaymentResponse(
                Status.SUCCESS, "Payment processed");

            // ── Step 4: Save result & mark as SUCCESS ──
            idempotencyRepo.updateStatus(
                key, Status.SUCCESS, response);

            return response;

        } catch (Exception e) {
            // Payment failed — mark as FAILED so retries can reattempt
            idempotencyRepo.updateStatus(
                key, Status.FAILED, null);
            throw e;
        }
    }
}
Critical: Steps 1 (claim key), 2 (deduct balance), and 3 (save result) must happen in the SAME database transaction. This ensures atomicity — if any step fails, everything rolls back, including the idempotency record.
All Strategies

5 Approaches to Payment Idempotency

① Idempotency Key (Primary Strategy) ⭐

  • Client generates a UUID for each unique payment intent
  • Server stores key + result in the same DB transaction as the payment
  • Retries with the same key return the cached result
  • Used by: Stripe, Razorpay, PayPal

② Unique Transaction ID / Request ID

  • Similar to idempotency key, but server-generated
  • Add UNIQUE constraint on transaction table: (user_id, request_id)
  • DB rejects the duplicate INSERT → no double deduction
  • Simpler but relies on DB constraint enforcement

③ Optimistic Locking with Version Check

  • Each account has a version column
  • UPDATE accounts SET balance = balance - 100, version = version + 1 WHERE id = ? AND version = ?
  • If version changed (concurrent update), UPDATE affects 0 rows → retry with fresh read
  • Prevents lost updates but doesn't prevent retries inherently

④ State Machine for Transactions

  • Payment goes through states: INITIATED → PROCESSING → SUCCESS/FAILED
  • Only INITIATED → PROCESSING transition deducts money
  • Retries check current state — if already PROCESSING or SUCCESS, skip deduction
  • Used for complex multi-step payments (escrow, settlements)

⑤ Kafka Producer Idempotency (for event-driven systems)

  • Enable enable.idempotence=true in Kafka producer config
  • Kafka assigns a Producer ID + Sequence Number to each message
  • Broker detects and deduplicates retried messages automatically
  • Guarantees exactly-once at the Kafka layer — still need app-level idempotency for DB writes
Edge Cases

Race Conditions & How to Handle Them

  RACE CONDITION: Two retries arrive simultaneously
  ═══════════════════════════════════════════════════════════

  Thread A                          Thread B
     │                                  │
     │── Check key "abc-123" ──►        │── Check key "abc-123" ──►
     │◄── NOT FOUND ───────────         │◄── NOT FOUND ───────────
     │                                  │
     │── INSERT key "abc-123" ──►       │── INSERT key "abc-123" ──►
     │◄── OK ──────────────────         │◄── DuplicateKeyException! ✓
     │                                  │    (DB UNIQUE constraint
     │── Deduct balance ──────►         │     saves us)
     │◄── OK ──────────────────         │
     │                                  │── Return "conflict" ────►
     │── Mark SUCCESS ────────►         │
     │◄── OK ──────────────────         │

  SOLUTION: The UNIQUE constraint on idempotency_key in the DB
  acts as the final safety net, even if the application-level
  check misses a race condition.


  ADDITIONAL SAFEGUARDS:
  ┌─────────────────────────────────────────────────────────┐
  │  1. SELECT ... FOR UPDATE (pessimistic locking)         │
  │     Lock the row when checking, preventing concurrent   │
  │     reads until the transaction completes.              │
  │                                                         │
  │  2. Redis SETNX (distributed lock for microservices)    │
  │     SETNX idempotency:abc-123 "processing" EX 30       │
  │     Only one thread wins. Others get "already locked."  │
  │                                                         │
  │  3. Database advisory locks                             │
  │     pg_advisory_xact_lock(hash(idempotency_key))        │
  │     Lightweight, auto-released on transaction end.      │
  └─────────────────────────────────────────────────────────┘
Combined: Kafka + Payment Idempotency

End-to-End Architecture

  FULL ARCHITECTURE: Event-Driven Payment with Idempotency
  ═══════════════════════════════════════════════════════════

  ┌──────────┐     ┌────────────┐     ┌─────────────────┐     ┌──────────┐
  │  Client   │────►│   API       │────►│  Kafka Topic     │────►│ Payment   │
  │  (App)    │     │  Gateway    │     │  "payment-       │     │ Consumer  │
  │           │     │             │     │   requests"      │     │ Service   │
  └──────────┘     └────────────┘     └─────────────────┘     └────┬─────┘
       │                                                           │
       │  Idempotency-Key                                          │
       │  in HTTP header                                           │
       │                                  ┌───────────────────┐    │
       │                                  │   Database         │◄───┘
       │                                  │  ┌───────────────┐ │
       │                                  │  │idempotency_   │ │
       │                                  │  │store          │ │
       │                                  │  ├───────────────┤ │
       │                                  │  │accounts       │ │
       │                                  │  ├───────────────┤ │
       │                                  │  │transactions   │ │
       │                                  │  └───────────────┘ │
       │                                  └───────────────────┘
       │
       │  THREE LAYERS OF IDEMPOTENCY:
       │
       │  Layer 1: API Gateway deduplicates by Idempotency-Key
       │           (Redis cache, TTL = 24h)
       │
       │  Layer 2: Kafka producer idempotence (enable.idempotence=true)
       │           Prevents duplicate messages in topic
       │
       │  Layer 3: Payment Consumer checks idempotency_store in DB
       │           (UNIQUE constraint = final safety net)
Interview Q&A

Follow-up Questions They Might Ask

Who generates the idempotency key — client or server?
Client. The client generates a UUID before making the request. This way, if the client retries, it sends the same key. If the server generated it, the client wouldn't know the key to send on retry (because the response was lost). Stripe's API uses this exact approach — clients pass an Idempotency-Key header.
How long do you keep idempotency keys?
24–48 hours typically. After that, the same key can be reused. Use a TTL-based cleanup: expires_at = created_at + 24h. A background job or database TTL mechanism (Cassandra TTL, Redis EXPIRE, pg_cron) deletes expired records. Stripe keeps keys for 24 hours.
What if the server crashes AFTER deducting balance but BEFORE saving the idempotency record?
This is exactly why both operations must be in the SAME database transaction. If the process crashes mid-transaction, the DB rolls back everything — both the balance deduction AND the idempotency record insertion. On retry, the key won't exist, so it processes normally. Atomicity is non-negotiable here.
What about distributed systems with separate DBs?
Use the Outbox Pattern. Write the payment result + an "outbox event" in the same local transaction. A separate process reads the outbox and publishes to Kafka. The consumer on the other service also checks its own idempotency store. This gives you exactly-once semantics across services without distributed transactions (which are fragile and slow).
How does Kafka's built-in idempotency work?
Producer-side: With enable.idempotence=true, the broker assigns each producer a PID (Producer ID) and tracks sequence numbers per partition. If a retry sends the same (PID, sequence) pair, the broker silently discards the duplicate. Consumer-side: Use isolation.level=read_committed with transactions to get exactly-once delivery. But this only covers the Kafka layer — your application still needs its own idempotency logic for database writes.
Can you use Redis instead of a DB for idempotency checks?
Yes, but with caveats. Redis is faster for lookups, making it great for the "check" step. But Redis is not durable by default — if Redis crashes, you lose keys and might reprocess. Best practice: Use Redis as a fast first-level cache (SETNX with TTL), and the database UNIQUE constraint as the durable safety net. Two layers: speed + safety.
Difference between idempotency and exactly-once processing?
Idempotency = applying the same operation multiple times produces the same result (application-level design). Exactly-once = the system guarantees a message is processed exactly one time (infrastructure-level guarantee). In practice, true exactly-once is extremely hard. Most systems achieve "effectively once" by combining at-least-once delivery + idempotent consumers. Kafka Streams achieves exactly-once within Kafka using transactions, but cross-system exactly-once still needs application-level idempotency.
Summary

Quick Interview Cheat Sheet

┌──────────────────────────────────────────────────────────────────┐
│                    INTERVIEW ANSWER STRUCTURE                     │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│  Q1: KAFKA BASICS                                                 │
│  → "Kafka is a distributed commit log with producers, consumers,  │
│     brokers, topics, partitions, and consumer groups."            │
│  → Draw the architecture diagram                                  │
│  → Explain message flow (6 steps)                                │
│  → Mention: acks, ISR, replication, retention, offsets            │
│                                                                   │
│  Q2: DEBUG CONSUMER NOT CONSUMING                                │
│  → "I'd approach this systematically..."                         │
│  → Verify producer → check consumer group → check lag             │
│  → Check auto.offset.reset → rebalancing → deserialization       │
│  → Show you know the CLI commands                                │
│                                                                   │
│  Q3: PAYMENT IDEMPOTENCY                                         │
│  → "Every request gets a unique idempotency key"                 │
│  → Check-before-process + SAME transaction + UNIQUE constraint    │
│  → Draw the before/after diagram                                 │
│  → Mention: Redis cache layer, TTL cleanup, outbox pattern       │
│  → Handle race conditions: SETNX, SELECT FOR UPDATE, DB unique   │
│                                                                   │
└──────────────────────────────────────────────────────────────────┘