← Back to Design & Development
High-Level Design

LeetCode / Online Judge

A coding platform that runs untrusted code from millions of strangers — sandboxed runners, an async submission queue, contest leaderboards in real time, and the data-plane / control-plane split that keeps it all from setting itself on fire.

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →
Step 1

Clarify Requirements

Before any boxes-and-arrows, we pin down what this thing must do and — just as importantly — what it must not do. The interesting half of LeetCode is not the problem list; it's the willingness to run code submitted by people we have never met.

✅ Functional

  • Browse a catalog of problems with statement, examples, constraints
  • Submit code in 10–15 languages (Python, C++, Java, Go, JS, …)
  • Get a verdict: AC, WA, TLE, MLE, RE, CE
  • "Run" mode — execute code on a custom test input (debug)
  • Submission history per user, per problem
  • Contests with a real-time leaderboard
  • Discussions, editorials, hints

⚙️ Non-Functional

  • Verdict latency: p99 under ~5 seconds for normal submissions
  • Sandboxed execution — one bad submission cannot touch the host
  • Fairness — contest traffic must not be starved by practice traffic
  • Availability — 99.9% (contest start = hard deadline, not "soon")
  • Scale — 100K concurrent during a weekly contest

🚫 Out of Scope (today)

  • IDE features (intellisense, refactor) — use the browser editor
  • Pair programming / mock interviews
  • Payments, billing, subscription tiers
  • Mobile native apps — responsive web only

🎯 The one thing that matters

We are building a service that runs strangers' code on our infrastructure. Every other component — problem DB, leaderboard, discussions — is a normal CRUD app. The sandbox is where the architecture lives or dies.

Step 2

Capacity & Scale Estimation

Numbers tell you which boxes need to be plural. Without numbers, you end up over-engineering the easy parts and under-engineering the hard ones.

DimensionNumberWhy it matters
Total users~10MCatalog reads — cacheable, easy
Daily active users~1MSteady-state load on API + judge
Submissions / day~5M (avg)~60 / sec average
Submissions / sec (peak — weekly contest)~1,000–5,000The number that sizes the runner pool
Avg test cases per problem~30–100Each "submission" = many runner executions
Avg CPU per submission1–3 secondsDetermines runner throughput per node
Concurrent contest users (global round)~100KHuge spike, narrow window
Storage — submissions~1KB × 5M/day = 5GB/dayAppend-only, cheap to keep forever
Storage — test cases~3K problems × ~5MB = ~15GBStatic, S3-friendly, CDN-cacheable
The one bottleneck: at peak, we need ~5,000 submissions/sec × 2s CPU each = 10,000 CPU-seconds per second. That's 10,000 always-on cores — or, more realistically, an autoscaling pool that doubles when a contest starts and shrinks when it ends. This single number sizes 80% of the infrastructure cost.
Step 3

Actors & Use Cases

Three kinds of people touch this system, and they want very different things from it. Naming them up front prevents the design from drifting toward "average user" — there is no average.

flowchart LR P([Practitioner]) C([Contestant]) A([Admin / Setter]) P --> BR[Browse Problems] P --> RU[Run Code on Custom Input] P --> SU[Submit Solution] P --> HI[View Submission History] P --> DI[Read Discussions] C --> JC[Join Contest] C --> CS[Submit During Contest] C --> LB[View Live Leaderboard] A --> AP[Add / Edit Problem] A --> AT[Upload Test Cases] A --> AC[Configure Contest] style P fill:#e8743b,stroke:#e8743b,color:#fff style C fill:#4a90d9,stroke:#4a90d9,color:#fff style A fill:#9b72cf,stroke:#9b72cf,color:#fff

🟠 Practitioner

The 95% case — Priya, a backend engineer prepping for interviews after dinner. She wants the catalog snappy, the editor responsive, and her verdict back before she gets bored. She is patient enough to wait 5 seconds, but not 30.

🔵 Contestant

Same person, different mode. Once a week she joins a 90-minute timed contest with 100,000 strangers. Now fairness matters: her verdict can't be slower than someone else's just because the queue picked theirs first. Leaderboard rank must update in seconds, not minutes.

🟣 Admin / Setter

The problem author. Rarely on the platform, but when they're there they upload massive test files (a 50MB stress test is normal) and need a private "validation run" to make sure their solution actually passes their own tests.

Step 4

High-Level Architecture

We'll build this in three passes. First the naive design (and the four ways it falls over), then the central mental model, then the production shape with every component justified by something the previous design got wrong.

Pass 1 — The naive design (one server, one process)

Imagine you're building a hackathon prototype. The whole thing is one Flask app on one EC2 box. When Priya hits "Submit", the API handler shells out:

# naive_judge.py — DON'T do this in production
def judge(submission):
    with open("/tmp/sol.py", "w") as f:
        f.write(submission.code)
    for tc in testcases:
        out = subprocess.check_output(
            ["python", "/tmp/sol.py"],
            input=tc.input,
            timeout=2
        )
        if out != tc.expected: return "WA"
    return "AC"
flowchart LR CL([Client]) WS[Single Web Server
+ Python subprocess] DB[(SQLite)] CL -- POST /submit --> WS WS -- shell out --> WS WS --> DB style CL fill:#e8743b,stroke:#e8743b,color:#fff style WS fill:#e05252,stroke:#e05252,color:#fff style DB fill:#9b72cf,stroke:#9b72cf,color:#fff

Why this fails — four concrete deaths:

💀 Security

A malicious submission is just os.system("rm -rf /"). It runs as the same user as your API, on the same disk as your problems database. One curl and your business is gone.

💀 Isolation

An infinite loop or a fork-bomb pegs the only CPU. Now every other user's page load times out — including the homepage. One person broke the entire site.

💀 Throughput

2-second runs × 1 process = 0.5 submissions/sec. At a contest peak we need 5,000/sec. We are off by 10,000×.

💀 Latency coupling

The API request blocks for 5–30 seconds while testcases run. Half of those connections will time out at the load balancer before the verdict ever comes back. Users hit "submit" twice. Now you have duplicates.

So what? Each death names a component we'll need: sandbox (security + isolation), worker pool (throughput), async queue + push channel (latency coupling). The architecture is being written for us by the failures.

Pass 2 — The mental model: Web tier vs. Judge tier

The single biggest idea in this design is that the system has two completely different jobs, and they should run on completely different machines. Mixing them is the original sin of Pass 1.

🌐 Web Tier — trusted, fast, stateless

Standard web app stuff. Read problems, write submissions, list users, render pages. Predictable: every request is short, well-typed, and runs code we wrote. Scales horizontally with no surprises. CPU per request is tiny (microseconds).

Lives behind: a load balancer + autoscaler tuned for HTTP latency.

⚙️ Judge Tier — untrusted, slow, dangerous

Runs somebody else's code against test data. Unpredictable: 1ms or 30s, peaceful or hostile, leaks memory or fork-bombs. CPU per request is huge (seconds). Each runner is a disposable container that gets nuked after one submission.

Lives behind: a queue, never reachable directly from the internet.

The contract between them is a single message on a queue. Web → "judge this submission, here's the ID". Judge → "submission X got verdict Y". They never share a process, never share a disk, never share an OS user. That single boundary is what makes the whole design safe to operate.

Pass 3 — The production shape

Now the full picture. Every box below is here because Pass 1 broke without it. Numbers ①–⑫ on the diagram match the explanation grid below.

flowchart LR CL([① Client App
browser editor]) CDN[② CDN
statements · static assets] GW[③ API Gateway
auth · rate limit] API[④ API Service
problems · users · submissions] Q[/⑤ Submission Queue
Kafka/] DISP[⑥ Judge Dispatcher
routes by language] POOL[⑦ Runner Sandbox Pool
Firecracker/gVisor microVMs] TC[(⑧ Test Case Store
S3)] RS[/⑨ Result Stream
Kafka topic/] WS[⑩ WebSocket Push
verdict fan-out] LB[⑪ Leaderboard Service
Redis ZSET] DS[(⑫ Postgres + Redis
data + cache)] CL --> CDN CL --> GW GW --> API API --> DS API --> Q Q --> DISP DISP --> POOL POOL --> TC POOL --> RS RS --> API RS --> LB RS --> WS WS --> CL LB --> DS style CL fill:#e8743b,stroke:#e8743b,color:#fff style CDN fill:#4a90d9,stroke:#4a90d9,color:#fff style GW fill:#4a90d9,stroke:#4a90d9,color:#fff style API fill:#38b265,stroke:#38b265,color:#fff style Q fill:#9b72cf,stroke:#9b72cf,color:#fff style DISP fill:#e05252,stroke:#e05252,color:#fff style POOL fill:#e05252,stroke:#e05252,color:#fff style TC fill:#d4a838,stroke:#d4a838,color:#000 style RS fill:#9b72cf,stroke:#9b72cf,color:#fff style WS fill:#3cbfbf,stroke:#3cbfbf,color:#000 style LB fill:#38b265,stroke:#38b265,color:#fff style DS fill:#d4a838,stroke:#d4a838,color:#000

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card below. Each one answers: what is it, why is it here, and what would break if we ripped it out tomorrow.

Client App

The browser-side editor (Monaco), the problem renderer, the verdict UI. It's responsible for one tricky thing besides drawing: holding open a WebSocket after submit so it can paint the verdict the moment ⑩ sends it. Without that socket, the user would have to poll, and we'd burn 100K poll requests per minute during contests for no reason.

Solves: the "did my submission finish yet?" question, with zero polling.

CDN

Problem statements, examples, editorials, syntax-highlight CSS — all immutable for hours at a time. The CDN (CloudFront / Fastly) is just a bunch of edge servers that cache these and serve them from the city closest to Priya. A user in Bangalore opens "Two Sum" and the bytes come from a Mumbai POP, not from us-east-1.

Solves: 90% of read traffic never reaches our origin. One spike-protection layer for free.

API Gateway

The first thing every dynamic request hits. Terminates TLS, checks the auth cookie/JWT, applies per-user rate limits (e.g. 10 submits/min), and routes to the right backend service. It's the bouncer at the door — by the time a request reaches ④, it's already been authenticated and rate-checked.

Solves: if removed, every backend service would have to re-implement auth and rate limiting. Plus, no protection against a casual scraper hammering /submit 10K times/sec.

API Service

The "boring" part of the platform — Spring Boot or FastAPI, doesn't matter. Owns the CRUD endpoints: list problems, fetch a problem, post a submission, fetch submission history. When it accepts a submission, it does not run the code — it writes a row in Postgres marked PENDING and drops a message on ⑤.

Solves: decouples "I received your code" (fast, ~10ms) from "I have a verdict" (slow, ~5s). The HTTP request returns the moment the row is written.

Submission Queue

A Kafka topic (or SQS queue). Every submission gets one message. Why a queue and not a direct call? Three reasons: buffering (a contest dumps 5K submissions in one second; the queue absorbs the spike while runners catch up), fairness (FIFO inside a partition prevents one heavy submission from blocking the next), and retries (if a runner crashes, the message is redelivered, no data lost).

Solves: the "contest start = traffic 100×" problem becomes a "queue depth grows for 90s and drains" problem — manageable instead of catastrophic.

Judge Dispatcher

A small service (or a Kubernetes operator) that consumes ⑤ and decides which runner pool to send each message to. Python submissions go to the Python runner pool, C++ to the C++ pool, etc. It's also where contest submissions get bumped to a high-priority partition so they jump the queue past practice traffic.

Solves: one-language-per-runner means each runner image is small and starts in <100ms. Without dispatch, every runner would need every compiler — 5GB+ images and slow cold starts.

Runner Sandbox Pool

The dangerous part — the only place untrusted code actually executes. Each runner is a disposable microVM (Firecracker) or gVisor-wrapped container: 256MB RAM cap, 2s CPU cap, no network, read-only filesystem, separate kernel namespace. One submission = one runner = one execution = nuked. We never reuse a sandbox between users.

Solves: a malicious fork() bomb only kills its own VM. The host, the database, every other submission — untouched. Without this, the entire platform is one os.system away from being deleted.

Test Case Store

S3 (or any object store). Holds the actual input.txt / expected.txt files for every problem. Test cases can be huge (50MB stress tests are normal). Storing them in Postgres would bloat rows; embedding them in the runner image would make images gigantic. S3 is the right place: cheap, immutable, signed-URL access from the runner.

Solves: separating big binary blobs from relational data — the textbook reason object stores exist.

Result Stream

A second Kafka topic, this one carrying verdicts out of the judge tier. Multiple consumers care about each verdict: ④ updates the submissions row, ⑪ updates the leaderboard, ⑩ pushes to the user's browser. Putting this on a stream instead of point-to-point calls means we can add new consumers without changing the runner — analytics, fraud detection, achievement badges, all just subscribe.

Solves: fan-out without coupling. The runner doesn't know who cares about its result; it just publishes.

WebSocket Push

A long-lived connection service. When Priya opens a problem, her browser opens a socket here keyed on her user ID. When ⑨ emits her verdict, this service finds the socket and pushes the message in milliseconds. No polling, no "refresh to see verdict".

Solves: the latency-perception problem. The actual judge takes ~5s, but the user sees the verdict the instant it's ready, not 30s later when polling next fires.

Leaderboard Service

A small service whose entire life is consuming ⑨ and updating a Redis sorted set per contest: ZADD contest:1234 <score> user_id. The leaderboard page asks Redis for the top 100 with ZREVRANGE — single-digit milliseconds, no Postgres scan. During a contest, every contestant's browser long-polls or sockets this view and sees rank changes within a second of any verdict landing.

Solves: "show me the live top 100 of 100K" as a sub-millisecond Redis op instead of a 30-second SQL aggregate.

Postgres + Redis

Postgres is the source of truth — users, problems, submissions, contests. Redis is the hot cache (problem statement HTML, "have I solved this" lookups, leaderboards, rate-limit counters). Anything Redis loses can be rebuilt from Postgres; nothing in Postgres is rebuildable. That asymmetry is the whole reason we keep both.

Solves: Postgres = durability, Redis = speed. Picking one would mean either slow reads or losing submission history on a node failure.

So what — in one line: the API tier moves messages, the judge tier moves code, and they only ever speak through queues. That single split is what makes 5,000 hostile submissions per second a routine Sunday afternoon instead of a Sev-1.
Step 5

The Sandbox — Docker vs. gVisor vs. Firecracker

If you remember one thing from this page, make it this section. The choice of how to isolate untrusted code is the single biggest cost-vs-security trade-off in the whole design.

Imagine you're a casino dealer. Every player is potentially a card counter, a thief, or just careless. You can't ban them — they're your business. So you give each one their own table, their own deck, their own chips, and you nuke the table after one hand. That's a sandbox. The question is: how heavy is the table?

🐳 Plain Docker

What: a container with seccomp + cgroups. Shares the host kernel.

Pros: fast (50ms cold start), cheap, ubiquitous tooling.

Cons: a kernel exploit (CVE-2022-0185 style) breaks out to the host. Not sufficient on its own for hostile multi-tenant code.

When it's enough: internal tools, trusted CI runners, single-tenant.

🛡️ gVisor

What: Google's user-space kernel. Intercepts every syscall in a Go process before the real kernel sees it.

Pros: dramatically smaller kernel attack surface than Docker, cold start ~100ms, fits the container ecosystem.

Cons: 5–30% perf overhead (every syscall is intercepted), some syscalls unimplemented (rare languages may misbehave).

When it fits: the LeetCode sweet spot. Strong isolation, container-grade ergonomics.

🔥 Firecracker microVM

What: AWS's KVM-based microVM. Each "container" is actually a tiny real Linux VM — own kernel, own memory, own everything.

Pros: hardware-level isolation. Even a kernel zero-day stays inside the VM. ~125ms cold start, ~5MB overhead per VM.

Cons: heavier than gVisor on dense hosts; tooling is less mature.

When it fits: FaaS (this is what AWS Lambda runs on), or any platform where you'd lose the company if the host got rooted.

What we'd actually pick: gVisor for the default runner, Firecracker for "heavy" or contest submissions where the extra isolation is worth the extra start-up time. Plain Docker is never the answer for hostile code, even if 80% of online judges historically used it (and most of those got pwned at some point).

Inside one runner — the hard limits

Sandbox is necessary but not sufficient. Inside the sandbox we still need limits or a clever submission can DoS its own VM and starve the queue. Every runner enforces:

LimitMechanismWhat it prevents
CPU timesetrlimit(RLIMIT_CPU) + wallclock timerInfinite loops
Memorycgroup memory limit (256MB)Memory bombs
File descriptorsRLIMIT_NOFILE = 64fd exhaustion
Process countRLIMIT_NPROC = 32Fork bombs
NetworkNetwork namespace, no interfacesExfiltration, scanning
FilesystemRead-only root, tiny tmpfs at /tmpDisk fills, persistence
Output sizeBytes written capped at 64KBLog floods
Step 6

Submission Flow — Sequence

The happy path, end to end. Note where the request returns (fast), where the verdict actually lands (slow), and how the user finds out (push).

sequenceDiagram actor U as Priya participant GW as API Gateway participant API as API Service participant DB as Postgres participant Q as Submit Queue participant D as Dispatcher participant R as Runner participant TC as S3 Testcases participant RS as Result Stream participant WS as WS Push participant LB as Leaderboard U->>GW: POST /submit code + problem_id GW->>GW: auth and rate limit GW->>API: forward API->>DB: INSERT submission status=PENDING API->>Q: publish submission_id API-->>U: 202 Accepted Note over U,WS: Browser keeps WS open and waits for push Q->>D: consume D->>R: dispatch to Python pool R->>TC: GET testcases via signed URL R->>R: compile and run testcases R->>RS: publish verdict + runtime RS->>API: consume API->>DB: UPDATE status=AC runtime=47ms RS->>LB: consume LB->>LB: ZADD contest score user_id RS->>WS: consume WS->>U: push verdict AC runtime 47ms
Read this twice: the HTTP request returns (202 Accepted) before any code is run. From the API's perspective, "submit" is just "write a row + drop a message". The actual heavy lifting happens asynchronously, and the user finds out via push. This is the single most important shape difference from the naive design.
Step 7

Verdict State Machine

Every submission moves through a small, well-defined set of states. Drawing it explicitly catches edge cases — like "what happens if the runner crashes mid-execution?" — that are easy to miss in prose.

stateDiagram-v2 [*] --> PENDING: user hits submit PENDING --> COMPILING: runner picks up COMPILING --> RUNNING: compile ok COMPILING --> CE: compile error RUNNING --> AC: all testcases pass RUNNING --> WA: any output mismatch RUNNING --> TLE: CPU limit hit RUNNING --> MLE: memory limit hit RUNNING --> RE: non-zero exit / signal RUNNING --> SE: runner crash / OOM host SE --> PENDING: auto-retry (max 2) SE --> SYSTEM_FAIL: retries exhausted AC --> [*] WA --> [*] TLE --> [*] MLE --> [*] RE --> [*] CE --> [*] SYSTEM_FAIL --> [*]

🟢 Terminal "user error" states

AC Accepted, WA Wrong Answer, TLE Time Limit Exceeded, MLE Memory Limit Exceeded, RE Runtime Error, CE Compile Error. These are deterministic — re-running the same code on the same input gives the same answer. Show them once and move on.

🔴 The tricky one — SE System Error

The runner itself crashed (OOM-killed by the host, container died, network blip to S3). This is not the user's fault and we must not penalize them. The dispatcher auto-retries up to 2 times before bubbling up SYSTEM_FAIL and pinging on-call.

Step 8

Data Model

The relational core. Test cases live in S3 (referenced by URL), source code lives in object storage too once it's old (recent submissions stay in Postgres for fast history reads).

erDiagram USER ||--o{ SUBMISSION : submits PROBLEM ||--o{ SUBMISSION : "has many" PROBLEM ||--o{ TESTCASE : owns CONTEST ||--o{ CONTEST_PROBLEM : includes PROBLEM ||--o{ CONTEST_PROBLEM : appears_in USER ||--o{ CONTEST_REGISTRATION : joins CONTEST ||--o{ CONTEST_REGISTRATION : has USER { uuid user_id PK string handle string email int rating timestamp created_at } PROBLEM { uuid problem_id PK string slug string title text statement_md string difficulty int time_limit_ms int memory_limit_mb } TESTCASE { uuid testcase_id PK uuid problem_id FK string s3_input_url string s3_expected_url bool is_sample int weight } SUBMISSION { uuid submission_id PK uuid user_id FK uuid problem_id FK uuid contest_id "nullable FK" string language string s3_code_url string status "PENDING|AC|WA|TLE|MLE|RE|CE|SE" int runtime_ms int memory_kb timestamp submitted_at } CONTEST { uuid contest_id PK string title timestamp start_at timestamp end_at string scoring "ICPC|IOI" } CONTEST_PROBLEM { uuid contest_id FK uuid problem_id FK int points } CONTEST_REGISTRATION { uuid contest_id FK uuid user_id FK int rank int score }
Two intentional choices: (1) Source code is stored in S3, not in the SUBMISSION row — a row is ~200 bytes, code blobs would bloat the table 100×. (2) contest_id on submission is nullable — practice and contest submissions share the same table, so submission history is trivial to query.
Step 9

Contests & the Real-Time Leaderboard

The contest is the demo case. Everything that's lukewarm during normal traffic is on fire here: 100K people start at the same second, every submission must rank against all the others, and the leaderboard updates a board everyone is staring at.

🚦 Priority lane in the queue

Kafka topics are partitioned, and the dispatcher reads contest-tagged messages from a separate high-priority partition. Practice submissions can sit in queue for 30 seconds during a peak — contest submissions can't sit for more than 1 second. Different SLOs, different lanes.

⚡ Why Redis ZSET wins for ranking

A leaderboard is "give me the top 100 sorted by score". In Postgres that's a full-table sort every time the page renders — 100K rows, milliseconds. In Redis: ZADD is O(log N), ZREVRANGE 0 99 is O(log N + 100). At our scale that's microseconds, and it scales to millions of contestants without blinking.

Scoring quirks the design must support

🏁 ICPC scoring

1 point per accepted problem, plus a "time penalty" (minutes since contest start) per accepted problem and 20 minutes per wrong attempt. The leaderboard service computes this on each verdict; the ZSET score is encoded as (problems_solved * 10^9) - total_penalty.

🎯 IOI scoring

Each problem awards partial credit per testcase. The leaderboard takes the max score the user has ever achieved on each problem, summed. Updates only when a new submission scores higher than the previous best.

Step 10

Anti-Abuse & Fairness

Once the design works, the next problem is people gaming it. Some are malicious (try to break out of the sandbox, scrape problem statements). Some are just lazy (submit-paste, throwaway accounts). The platform has to push back on both.

🛑 Sandbox break-out attempts

Telemetry from each runner reports unusual syscall patterns (e.g. attempts to ptrace, raw socket creation, repeated mmap of large pages). Pattern hits → submission flagged, account silently shadow-banned from contests pending review. The user keeps practicing; the leaderboard never sees them.

📋 Plagiarism detection

For every accepted contest submission, run a structural similarity hash (token-stream MinHash) and compare against other submissions on the same problem. Pairs above a threshold are flagged for human review. This runs off the hot path, downstream of ⑨, so the verdict isn't delayed.

⏱️ Rate limiting

API gateway enforces 10 submits/min/user, 1000/min/IP. Redis sliding-window counters back this. During contests the per-user limit is tighter (2/min) to discourage brute-force "submit and see if it passes".

🎭 Account-creation friction

Email verification + reCAPTCHA on signup. Contest registration requires an account that's at least 24 hours old. Cheap throwaways become expensive throwaways.

Step 11

Trade-offs & Interview Talking Points

The decisions that aren't obvious — and the ones an interviewer will probe.

DecisionAlternativeWhy this choice
Async verdict via WebSocketSync HTTP "wait for verdict"Sync ties up an LB connection for 5–30s. At 5K rps that's instant connection-pool exhaustion. Async + push scales linearly.
gVisor / Firecracker for runnersPlain DockerPlain Docker shares the host kernel — one CVE breaks the box. Strong isolation is non-negotiable for hostile multi-tenant code.
S3 for test cases & source codePostgres BLOBsBig binary blobs in a relational row bloat the table, kill cache hit rates, and break replication. Object stores exist for exactly this.
Separate contest queue partitionOne queue, FIFOFIFO means a 30-second practice submission can delay a contest submission. Different SLOs need different lanes.
Redis ZSET for leaderboardPostgres ORDER BY scorePostgres top-100 over 100K rows is a sort every refresh. ZSET makes it a constant-time op. The data is rebuildable if Redis dies.
Kafka result topic with multiple consumersAPI service calls leaderboard + WS directlyTight coupling. New consumers (analytics, fraud) shouldn't require redeploying the runner. Streams give us fan-out for free.

Common interviewer follow-ups

What if the WebSocket service is down when the verdict lands?
The verdict is in Postgres regardless. The WebSocket is a convenience for "live" updates. On reconnect, the client fetches its recent submissions via HTTP and reconciles. The push channel is best-effort; the durable record is the database row.
A submission against 100 testcases — do we run them sequentially or in parallel?
Sequentially within one runner, but with early termination on first failure: if testcase 7 fails, we don't run 8–100. This both saves compute (50%+ on average for failing submissions) and feels faster to the user. We could parallelize across runners, but the fan-out + result-merge complexity isn't worth it for typical 100-testcase problems.
How do you handle a malicious submission that tries to mine crypto for 2 seconds?
It mines crypto for 2 seconds. Then the cgroup CPU limit kills it, the runner reports TLE, the sandbox is destroyed. Total damage: 2 seconds of one core on one VM. We sized the runner pool to absorb this — it's not a special case, it's the default behavior.
What about non-deterministic problems (random output, timing-based)?
We expose a special judge ("checker") feature — instead of byte-comparing output, we run an admin-supplied checker binary that takes (input, user_output) and returns AC/WA. The checker itself runs in the same sandbox class, just with a different executable. This handles "any valid permutation" or floating-point tolerance problems.
Step 12

Walkthrough — Priya solves Two Sum

One concrete request through every numbered component. If the design is right, this story should feel inevitable.

  1. 1. Priya opens leetcode.com/problems/two-sum. The HTML and the problem statement come from the CDN ②. Latency: ~50ms from her ISP. Our origin sees nothing.
  2. 2. Her browser opens a WebSocket to WS Push ⑩ as soon as the page loads. The socket sits idle, keyed on her user ID, ready for any verdict the backend wants to deliver.
  3. 3. She types her Python solution in the Monaco editor (Client ①) and clicks Submit. The browser POSTs to /v1/submit.
  4. 4. API Gateway ③ validates her JWT, checks her per-user rate limit in Redis (INCR submits:priya), passes the request to API Service ④.
  5. 5. API Service writes a SUBMISSION row to Postgres ⑫ with status PENDING, uploads the source to S3 (linked by s3_code_url), publishes { submission_id, lang: "python", problem_id } to the Submission Queue ⑤, returns 202 { submission_id } to Priya. Total time: ~30ms.
  6. 6. The Judge Dispatcher ⑥ consumes the message, sees lang=python, routes it to the Python runner pool. Since this is a practice submission (no contest_id), it goes on the standard-priority partition.
  7. 7. A free Runner Sandbox ⑦ (gVisor) picks up the job. It downloads the test cases for problem two-sum from S3 ⑧ via signed URL, downloads Priya's source, compiles (Python: nothing to do), and runs it against testcases. Each testcase is one fresh subprocess inside the sandbox with seccomp + cgroup limits. Total CPU spent: 47ms across 30 testcases.
  8. 8. All 30 testcases pass. The runner publishes { submission_id, verdict: "AC", runtime_ms: 47, memory_kb: 14336 } to the Result Stream ⑨. The sandbox VM is destroyed.
  9. 9. Three consumers process the message in parallel:
    • API Service ④ updates the submission row in Postgres: status=AC, runtime=47ms.
    • Leaderboard Service ⑪ sees this isn't a contest submission, ignores.
    • WS Push ⑩ finds Priya's open socket and pushes { verdict: "AC", runtime: "47ms" } to her browser.
  10. 10. Priya's browser flashes the green "Accepted" banner. Total wall-clock from clicking Submit to seeing AC: ~1.5 seconds. She didn't poll once. The HTTP request that started this returned at step 5; everything that followed happened on the back of one Kafka message.
Trace what just happened: steps 1–5 are the web tier (trusted, fast, predictable). Steps 6–8 are the judge tier (untrusted, slow, dangerous). Step 9 is the fan-out. The two tiers never share an OS user, never share a kernel, and never speak except through ⑤ and ⑨. That single discipline is what makes Priya's quiet Sunday solve and a hostile fork-bomb live in the same system without one ever noticing the other.

Did this rewire how you think about online judges? If it clicked, tap the ❤️ — that's how I know it hit.