A coding platform that runs untrusted code from millions of strangers — sandboxed runners, an async submission queue, contest leaderboards in real time, and the data-plane / control-plane split that keeps it all from setting itself on fire.
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
Before any boxes-and-arrows, we pin down what this thing must do and — just as importantly — what it must not do. The interesting half of LeetCode is not the problem list; it's the willingness to run code submitted by people we have never met.
AC, WA, TLE, MLE, RE, CEWe are building a service that runs strangers' code on our infrastructure. Every other component — problem DB, leaderboard, discussions — is a normal CRUD app. The sandbox is where the architecture lives or dies.
Numbers tell you which boxes need to be plural. Without numbers, you end up over-engineering the easy parts and under-engineering the hard ones.
| Dimension | Number | Why it matters |
|---|---|---|
| Total users | ~10M | Catalog reads — cacheable, easy |
| Daily active users | ~1M | Steady-state load on API + judge |
| Submissions / day | ~5M (avg) | ~60 / sec average |
| Submissions / sec (peak — weekly contest) | ~1,000–5,000 | The number that sizes the runner pool |
| Avg test cases per problem | ~30–100 | Each "submission" = many runner executions |
| Avg CPU per submission | 1–3 seconds | Determines runner throughput per node |
| Concurrent contest users (global round) | ~100K | Huge spike, narrow window |
| Storage — submissions | ~1KB × 5M/day = 5GB/day | Append-only, cheap to keep forever |
| Storage — test cases | ~3K problems × ~5MB = ~15GB | Static, S3-friendly, CDN-cacheable |
Three kinds of people touch this system, and they want very different things from it. Naming them up front prevents the design from drifting toward "average user" — there is no average.
The 95% case — Priya, a backend engineer prepping for interviews after dinner. She wants the catalog snappy, the editor responsive, and her verdict back before she gets bored. She is patient enough to wait 5 seconds, but not 30.
Same person, different mode. Once a week she joins a 90-minute timed contest with 100,000 strangers. Now fairness matters: her verdict can't be slower than someone else's just because the queue picked theirs first. Leaderboard rank must update in seconds, not minutes.
The problem author. Rarely on the platform, but when they're there they upload massive test files (a 50MB stress test is normal) and need a private "validation run" to make sure their solution actually passes their own tests.
We'll build this in three passes. First the naive design (and the four ways it falls over), then the central mental model, then the production shape with every component justified by something the previous design got wrong.
Imagine you're building a hackathon prototype. The whole thing is one Flask app on one EC2 box. When Priya hits "Submit", the API handler shells out:
# naive_judge.py — DON'T do this in production def judge(submission): with open("/tmp/sol.py", "w") as f: f.write(submission.code) for tc in testcases: out = subprocess.check_output( ["python", "/tmp/sol.py"], input=tc.input, timeout=2 ) if out != tc.expected: return "WA" return "AC"
Why this fails — four concrete deaths:
A malicious submission is just os.system("rm -rf /"). It runs as the same user as your API, on the same disk as your problems database. One curl and your business is gone.
An infinite loop or a fork-bomb pegs the only CPU. Now every other user's page load times out — including the homepage. One person broke the entire site.
2-second runs × 1 process = 0.5 submissions/sec. At a contest peak we need 5,000/sec. We are off by 10,000×.
The API request blocks for 5–30 seconds while testcases run. Half of those connections will time out at the load balancer before the verdict ever comes back. Users hit "submit" twice. Now you have duplicates.
The single biggest idea in this design is that the system has two completely different jobs, and they should run on completely different machines. Mixing them is the original sin of Pass 1.
Standard web app stuff. Read problems, write submissions, list users, render pages. Predictable: every request is short, well-typed, and runs code we wrote. Scales horizontally with no surprises. CPU per request is tiny (microseconds).
Lives behind: a load balancer + autoscaler tuned for HTTP latency.
Runs somebody else's code against test data. Unpredictable: 1ms or 30s, peaceful or hostile, leaks memory or fork-bombs. CPU per request is huge (seconds). Each runner is a disposable container that gets nuked after one submission.
Lives behind: a queue, never reachable directly from the internet.
Now the full picture. Every box below is here because Pass 1 broke without it. Numbers ①–⑫ on the diagram match the explanation grid below.
Use the numbers in the diagram above to find the matching card below. Each one answers: what is it, why is it here, and what would break if we ripped it out tomorrow.
The browser-side editor (Monaco), the problem renderer, the verdict UI. It's responsible for one tricky thing besides drawing: holding open a WebSocket after submit so it can paint the verdict the moment ⑩ sends it. Without that socket, the user would have to poll, and we'd burn 100K poll requests per minute during contests for no reason.
Solves: the "did my submission finish yet?" question, with zero polling.
Problem statements, examples, editorials, syntax-highlight CSS — all immutable for hours at a time. The CDN (CloudFront / Fastly) is just a bunch of edge servers that cache these and serve them from the city closest to Priya. A user in Bangalore opens "Two Sum" and the bytes come from a Mumbai POP, not from us-east-1.
Solves: 90% of read traffic never reaches our origin. One spike-protection layer for free.
The first thing every dynamic request hits. Terminates TLS, checks the auth cookie/JWT, applies per-user rate limits (e.g. 10 submits/min), and routes to the right backend service. It's the bouncer at the door — by the time a request reaches ④, it's already been authenticated and rate-checked.
Solves: if removed, every backend service would have to re-implement auth and rate limiting. Plus, no protection against a casual scraper hammering /submit 10K times/sec.
The "boring" part of the platform — Spring Boot or FastAPI, doesn't matter. Owns the CRUD endpoints: list problems, fetch a problem, post a submission, fetch submission history. When it accepts a submission, it does not run the code — it writes a row in Postgres marked PENDING and drops a message on ⑤.
Solves: decouples "I received your code" (fast, ~10ms) from "I have a verdict" (slow, ~5s). The HTTP request returns the moment the row is written.
A Kafka topic (or SQS queue). Every submission gets one message. Why a queue and not a direct call? Three reasons: buffering (a contest dumps 5K submissions in one second; the queue absorbs the spike while runners catch up), fairness (FIFO inside a partition prevents one heavy submission from blocking the next), and retries (if a runner crashes, the message is redelivered, no data lost).
Solves: the "contest start = traffic 100×" problem becomes a "queue depth grows for 90s and drains" problem — manageable instead of catastrophic.
A small service (or a Kubernetes operator) that consumes ⑤ and decides which runner pool to send each message to. Python submissions go to the Python runner pool, C++ to the C++ pool, etc. It's also where contest submissions get bumped to a high-priority partition so they jump the queue past practice traffic.
Solves: one-language-per-runner means each runner image is small and starts in <100ms. Without dispatch, every runner would need every compiler — 5GB+ images and slow cold starts.
The dangerous part — the only place untrusted code actually executes. Each runner is a disposable microVM (Firecracker) or gVisor-wrapped container: 256MB RAM cap, 2s CPU cap, no network, read-only filesystem, separate kernel namespace. One submission = one runner = one execution = nuked. We never reuse a sandbox between users.
Solves: a malicious fork() bomb only kills its own VM. The host, the database, every other submission — untouched. Without this, the entire platform is one os.system away from being deleted.
S3 (or any object store). Holds the actual input.txt / expected.txt files for every problem. Test cases can be huge (50MB stress tests are normal). Storing them in Postgres would bloat rows; embedding them in the runner image would make images gigantic. S3 is the right place: cheap, immutable, signed-URL access from the runner.
Solves: separating big binary blobs from relational data — the textbook reason object stores exist.
A second Kafka topic, this one carrying verdicts out of the judge tier. Multiple consumers care about each verdict: ④ updates the submissions row, ⑪ updates the leaderboard, ⑩ pushes to the user's browser. Putting this on a stream instead of point-to-point calls means we can add new consumers without changing the runner — analytics, fraud detection, achievement badges, all just subscribe.
Solves: fan-out without coupling. The runner doesn't know who cares about its result; it just publishes.
A long-lived connection service. When Priya opens a problem, her browser opens a socket here keyed on her user ID. When ⑨ emits her verdict, this service finds the socket and pushes the message in milliseconds. No polling, no "refresh to see verdict".
Solves: the latency-perception problem. The actual judge takes ~5s, but the user sees the verdict the instant it's ready, not 30s later when polling next fires.
A small service whose entire life is consuming ⑨ and updating a Redis sorted set per contest: ZADD contest:1234 <score> user_id. The leaderboard page asks Redis for the top 100 with ZREVRANGE — single-digit milliseconds, no Postgres scan. During a contest, every contestant's browser long-polls or sockets this view and sees rank changes within a second of any verdict landing.
Solves: "show me the live top 100 of 100K" as a sub-millisecond Redis op instead of a 30-second SQL aggregate.
Postgres is the source of truth — users, problems, submissions, contests. Redis is the hot cache (problem statement HTML, "have I solved this" lookups, leaderboards, rate-limit counters). Anything Redis loses can be rebuilt from Postgres; nothing in Postgres is rebuildable. That asymmetry is the whole reason we keep both.
Solves: Postgres = durability, Redis = speed. Picking one would mean either slow reads or losing submission history on a node failure.
If you remember one thing from this page, make it this section. The choice of how to isolate untrusted code is the single biggest cost-vs-security trade-off in the whole design.
Imagine you're a casino dealer. Every player is potentially a card counter, a thief, or just careless. You can't ban them — they're your business. So you give each one their own table, their own deck, their own chips, and you nuke the table after one hand. That's a sandbox. The question is: how heavy is the table?
What: a container with seccomp + cgroups. Shares the host kernel.
Pros: fast (50ms cold start), cheap, ubiquitous tooling.
Cons: a kernel exploit (CVE-2022-0185 style) breaks out to the host. Not sufficient on its own for hostile multi-tenant code.
When it's enough: internal tools, trusted CI runners, single-tenant.
What: Google's user-space kernel. Intercepts every syscall in a Go process before the real kernel sees it.
Pros: dramatically smaller kernel attack surface than Docker, cold start ~100ms, fits the container ecosystem.
Cons: 5–30% perf overhead (every syscall is intercepted), some syscalls unimplemented (rare languages may misbehave).
When it fits: the LeetCode sweet spot. Strong isolation, container-grade ergonomics.
What: AWS's KVM-based microVM. Each "container" is actually a tiny real Linux VM — own kernel, own memory, own everything.
Pros: hardware-level isolation. Even a kernel zero-day stays inside the VM. ~125ms cold start, ~5MB overhead per VM.
Cons: heavier than gVisor on dense hosts; tooling is less mature.
When it fits: FaaS (this is what AWS Lambda runs on), or any platform where you'd lose the company if the host got rooted.
Sandbox is necessary but not sufficient. Inside the sandbox we still need limits or a clever submission can DoS its own VM and starve the queue. Every runner enforces:
| Limit | Mechanism | What it prevents |
|---|---|---|
| CPU time | setrlimit(RLIMIT_CPU) + wallclock timer | Infinite loops |
| Memory | cgroup memory limit (256MB) | Memory bombs |
| File descriptors | RLIMIT_NOFILE = 64 | fd exhaustion |
| Process count | RLIMIT_NPROC = 32 | Fork bombs |
| Network | Network namespace, no interfaces | Exfiltration, scanning |
| Filesystem | Read-only root, tiny tmpfs at /tmp | Disk fills, persistence |
| Output size | Bytes written capped at 64KB | Log floods |
The happy path, end to end. Note where the request returns (fast), where the verdict actually lands (slow), and how the user finds out (push).
202 Accepted) before any code is run. From the API's perspective, "submit" is just "write a row + drop a message". The actual heavy lifting happens asynchronously, and the user finds out via push. This is the single most important shape difference from the naive design.Every submission moves through a small, well-defined set of states. Drawing it explicitly catches edge cases — like "what happens if the runner crashes mid-execution?" — that are easy to miss in prose.
AC Accepted, WA Wrong Answer, TLE Time Limit Exceeded, MLE Memory Limit Exceeded, RE Runtime Error, CE Compile Error. These are deterministic — re-running the same code on the same input gives the same answer. Show them once and move on.
SE System ErrorThe runner itself crashed (OOM-killed by the host, container died, network blip to S3). This is not the user's fault and we must not penalize them. The dispatcher auto-retries up to 2 times before bubbling up SYSTEM_FAIL and pinging on-call.
The relational core. Test cases live in S3 (referenced by URL), source code lives in object storage too once it's old (recent submissions stay in Postgres for fast history reads).
SUBMISSION row — a row is ~200 bytes, code blobs would bloat the table 100×. (2) contest_id on submission is nullable — practice and contest submissions share the same table, so submission history is trivial to query.The contest is the demo case. Everything that's lukewarm during normal traffic is on fire here: 100K people start at the same second, every submission must rank against all the others, and the leaderboard updates a board everyone is staring at.
Kafka topics are partitioned, and the dispatcher reads contest-tagged messages from a separate high-priority partition. Practice submissions can sit in queue for 30 seconds during a peak — contest submissions can't sit for more than 1 second. Different SLOs, different lanes.
A leaderboard is "give me the top 100 sorted by score". In Postgres that's a full-table sort every time the page renders — 100K rows, milliseconds. In Redis: ZADD is O(log N), ZREVRANGE 0 99 is O(log N + 100). At our scale that's microseconds, and it scales to millions of contestants without blinking.
1 point per accepted problem, plus a "time penalty" (minutes since contest start) per accepted problem and 20 minutes per wrong attempt. The leaderboard service computes this on each verdict; the ZSET score is encoded as (problems_solved * 10^9) - total_penalty.
Each problem awards partial credit per testcase. The leaderboard takes the max score the user has ever achieved on each problem, summed. Updates only when a new submission scores higher than the previous best.
Once the design works, the next problem is people gaming it. Some are malicious (try to break out of the sandbox, scrape problem statements). Some are just lazy (submit-paste, throwaway accounts). The platform has to push back on both.
Telemetry from each runner reports unusual syscall patterns (e.g. attempts to ptrace, raw socket creation, repeated mmap of large pages). Pattern hits → submission flagged, account silently shadow-banned from contests pending review. The user keeps practicing; the leaderboard never sees them.
For every accepted contest submission, run a structural similarity hash (token-stream MinHash) and compare against other submissions on the same problem. Pairs above a threshold are flagged for human review. This runs off the hot path, downstream of ⑨, so the verdict isn't delayed.
API gateway enforces 10 submits/min/user, 1000/min/IP. Redis sliding-window counters back this. During contests the per-user limit is tighter (2/min) to discourage brute-force "submit and see if it passes".
Email verification + reCAPTCHA on signup. Contest registration requires an account that's at least 24 hours old. Cheap throwaways become expensive throwaways.
The decisions that aren't obvious — and the ones an interviewer will probe.
| Decision | Alternative | Why this choice |
|---|---|---|
| Async verdict via WebSocket | Sync HTTP "wait for verdict" | Sync ties up an LB connection for 5–30s. At 5K rps that's instant connection-pool exhaustion. Async + push scales linearly. |
| gVisor / Firecracker for runners | Plain Docker | Plain Docker shares the host kernel — one CVE breaks the box. Strong isolation is non-negotiable for hostile multi-tenant code. |
| S3 for test cases & source code | Postgres BLOBs | Big binary blobs in a relational row bloat the table, kill cache hit rates, and break replication. Object stores exist for exactly this. |
| Separate contest queue partition | One queue, FIFO | FIFO means a 30-second practice submission can delay a contest submission. Different SLOs need different lanes. |
| Redis ZSET for leaderboard | Postgres ORDER BY score | Postgres top-100 over 100K rows is a sort every refresh. ZSET makes it a constant-time op. The data is rebuildable if Redis dies. |
| Kafka result topic with multiple consumers | API service calls leaderboard + WS directly | Tight coupling. New consumers (analytics, fraud) shouldn't require redeploying the runner. Streams give us fan-out for free. |
TLE, the sandbox is destroyed. Total damage: 2 seconds of one core on one VM. We sized the runner pool to absorb this — it's not a special case, it's the default behavior.One concrete request through every numbered component. If the design is right, this story should feel inevitable.
leetcode.com/problems/two-sum. The HTML and the problem statement come from the CDN ②. Latency: ~50ms from her ISP. Our origin sees nothing./v1/submit.INCR submits:priya), passes the request to API Service ④.SUBMISSION row to Postgres ⑫ with status PENDING, uploads the source to S3 (linked by s3_code_url), publishes { submission_id, lang: "python", problem_id } to the Submission Queue ⑤, returns 202 { submission_id } to Priya. Total time: ~30ms.lang=python, routes it to the Python runner pool. Since this is a practice submission (no contest_id), it goes on the standard-priority partition.two-sum from S3 ⑧ via signed URL, downloads Priya's source, compiles (Python: nothing to do), and runs it against testcases. Each testcase is one fresh subprocess inside the sandbox with seccomp + cgroup limits. Total CPU spent: 47ms across 30 testcases.{ submission_id, verdict: "AC", runtime_ms: 47, memory_kb: 14336 } to the Result Stream ⑨. The sandbox VM is destroyed.status=AC, runtime=47ms.{ verdict: "AC", runtime: "47ms" } to her browser.Did this rewire how you think about online judges? If it clicked, tap the ❤️ — that's how I know it hit.