Design an Online Judge — System Design Interview Guide
Design an Online Judge is a system-design interview that asks you to build a LeetCode-style platform: users submit code, the system compiles and runs it against test cases inside a sandbox, and returns a verdict (accepted, wrong-answer, time-limit-exceeded, runtime-error) within seconds. The hard part is the sandboxing — running untrusted code safely without letting it escape or starve the host — plus queue back-pressure during contest spikes.
By Alex Chen, Founder, InterviewChamp.AI · Last verified
Reported in interviews at
- Amazon
- Microsoft
- Meta
- Bloomberg
Sourced from Glassdoor, Levels.fyi, and Blind interview reports.
Functional requirements
- User submits code in one of ~20 supported languages against a problem with a hidden test suite
- System compiles (where needed), runs the submission against each test case, and returns a verdict (accepted, wrong-answer, time-limit-exceeded, memory-limit-exceeded, runtime-error, compile-error)
- Per-submission resource limits: CPU time, wall-clock time, memory, disk, network
- Support contests: thousands of users submitting simultaneously, ranked by problems solved and submission time
- Plagiarism detection: flag submissions that are near-duplicates of other submissions or known sources
- Submission history: user can view all past submissions and verdicts per problem
Non-functional requirements
- Verdict latency: <10s p95 from submit to verdict for non-contest traffic; <30s p95 during contest peaks
- Throughput: ~100 submissions/sec baseline, ~5000/sec at contest peaks
- Sandbox isolation: no submission can read another submission's code, the test data, or the host filesystem
- Reliability: a hung or malicious submission must not block the queue; runaway processes get killed
- Cost: judge runners are the dominant cost; pool them across languages and problems to maximize utilization
- Fairness: in a contest, two users submitting the same problem at the same time should get verdicts in similar time regardless of queue position
Capacity estimation
Public-scale assumption: ~1M+ daily active users, ~5 submissions/user/day, so ~5M submissions/day = ~60/sec average. Contest spikes are the operational challenge: a popular weekly contest with 100K participants who each submit ~4 times in a 90-minute window = ~400K submissions in 5400 seconds = ~75/sec average across the contest, with peaks at start-of-contest (when everyone reads the first problem and submits a brute-force attempt) reaching ~5000/sec for short bursts.
Per-submission compute: a typical submission runs against ~20 test cases. Each test case has a CPU time limit (typically 1-5 seconds) and a memory limit (typically 256 MB-1 GB). A worst-case submission consumes ~100 seconds of CPU time. With a 5000/sec burst and 100s of work per submission, instantaneous compute demand = 500K CPU-seconds/sec = ~500K CPU cores. Pool of ~10K-50K cores typical; contest queue depth absorbs the difference.
Storage: each submission stores ~5-50 KB (source code) + verdict metadata (~500 bytes) + test-case-level results (~2 KB). At 5M submissions/day × ~10 KB average = ~50 GB/day, ~18 TB/year. Sharded by user_id in a durable store. Test data (the hidden test suite for each problem) is static — ~10 GB total across ~10K problems — cached on every judge runner.
Queue: at 5000/sec submission burst, the work queue depth can grow to ~50K items during a contest spike. Queue must persist (no submission lost on a server crash) and support priority (premium-tier users skip normal-tier queue).
Result-storage growth: contest leaderboards require fast reads of all verdicts per user per problem. With 1M users × 10K problems × ~50 bytes per (user, problem, best_verdict) tuple = ~500 GB sparse data, mostly compressible.
High-level design
Four core services: submission API, work queue, judge-runner pool, and result store. Plus a sandbox layer that wraps every code execution.
Submission API: user POSTs source code + problem_id + language. Server validates, writes submission row to durable store, returns submission_id, and enqueues a job with the submission_id + problem_id + language. Response is immediate (the judging happens asynchronously); the client polls or subscribes via websocket for the verdict.
Work queue: a persistent priority queue (Redis-backed or specialized queue service). Jobs have priority levels: contest > premium-tier user > normal-tier user. A job is (submission_id, problem_id, language, priority, enqueued_at). Workers pull jobs from the queue with priority ordering. The queue persists across worker crashes — a job in flight gets re-enqueued if its worker dies without acking.
Judge-runner pool: a fleet of worker hosts, each running ~4-16 sandbox slots in parallel. A worker pulls a job, hydrates the submission source code and the problem's test suite, compiles (for compiled languages), then runs the binary against each test case inside a sandbox slot. After all test cases run (or one fails on a non-AC verdict), the worker writes the verdict + per-test-case details to the result store and acks the queue job.
Result store: persistent storage for (submission_id, verdict, per_test_case_results, timing, memory_usage). Indexed by submission_id for direct lookup and by (user_id, problem_id) for 'show me my submissions for this problem.' Contest leaderboards read aggregated views derived from this store.
Problem store: each problem has a public statement, sample test cases, hidden test cases (the judging suite), reference solution(s), and per-language time/memory limits. Test cases live in object storage; judge runners pull them on demand and cache locally. Updates to problems (new test cases added to catch edge cases) trigger cache invalidation across the fleet.
Sandbox: see deep dive. Each code execution runs inside a sandbox that enforces resource limits (CPU, memory, wall time, disk write, network) and isolation (no access to other submissions, the host filesystem, or external services).
Deep dive — the hard problem
Three deep dives: the sandbox layer, queue back-pressure during contest spikes, and plagiarism detection.
Sandbox layer: untrusted code execution is the core security problem. A naive 'just run it in a subprocess' approach lets a malicious submission read the test data, exfiltrate the reference solution, or crash the host. Production judges use layered isolation.
The innermost layer is a kernel-level sandbox — Linux namespaces, cgroups, and seccomp filters (or equivalent on other OSes). Namespaces isolate the process's view of filesystem, network, PIDs, and users. cgroups enforce CPU time, CPU shares, memory, and disk I/O limits. Seccomp filters restrict which system calls the process can make — a sandboxed process typically can't open arbitrary files, fork unbounded child processes, or make network connections.
One layer up is a containerized runtime. Each submission runs in a fresh container (or a pooled container that's reset between submissions). The container has a minimal filesystem with just the language runtime, the test input, and a writable scratch directory. No access to other containers or the host.
For the highest-risk languages (anything that compiles to native code, or interpreted languages with broad syscall surfaces), a virtual-machine-level sandbox adds another layer. A microVM or lightweight hypervisor wraps the container, so even a container escape doesn't reach the host kernel. The cost is startup time (~hundreds of milliseconds for a VM vs ~tens of milliseconds for a container), which is acceptable for code-execution latencies of multiple seconds.
Resource limits are enforced per-test-case, not per-submission. If a single test case exceeds 2 seconds of CPU time, the verdict for that case is TLE (time-limit-exceeded), the runner kills the process, and the next test case starts in a fresh sandbox slot. This prevents one runaway test from poisoning later tests.
For compiled languages, the compile step also runs inside a sandbox — a malicious source file can trigger compiler bugs or consume unbounded resources during compilation. Compile time itself is capped (typically 10-30 seconds).
Network access is denied by default. Some problem types (mock-interview-style problems where the submission queries an API the judge exposes) need a controlled network path, but the default is no network — neither inbound nor outbound — so a malicious submission can't phone home with stolen test data.
Queue back-pressure during contest spikes: the start-of-contest moment is the worst spike. 100K participants click 'Start Contest' within a 30-second window; thousands of them immediately submit a quick first attempt. The queue grows to tens of thousands of pending jobs.
Four mitigations work together. (1) Auto-scaling: the runner pool grows on queue-depth signals. Most clouds add capacity in 30-90 seconds, which is too slow for instant spikes but catches up within minutes. (2) Pre-warming: for scheduled contests, the runner pool is scaled up 5-10 minutes before the start. (3) Priority queues: paying users and contest participants get priority over background submissions, so the contest experience is protected even when overall queue depth spikes. (4) Verdict latency degradation: during peaks, the SLA loosens (verdicts in 30s instead of 10s) gracefully rather than failing — the queue keeps absorbing, the runners keep working, and users see slower but still-functional verdicts.
Fairness in a contest: two users submitting the same problem at similar times shouldn't get verdicts at wildly different times based on which runner picked them up. Sharding the queue by problem_id (or by submission_hash) plus fair-share scheduling per user prevents one heavy submitter from monopolizing runners. Mention fairness as a contest-specific concern.
Plagiarism detection: submissions are compared against (a) other submissions on the same problem, (b) public code repositories indexed for popular problems, and (c) known cheat-sheet aggregators. The detection runs asynchronously after the verdict — a flag isn't an automatic rejection, it's an escalation to a moderation queue.
The algorithm typically uses code-similarity metrics. Token-level normalization (rename variables to placeholders, normalize whitespace) followed by Winnowing or MOSS-style fingerprinting catches simple copy-paste with renamed variables. AST-level comparison catches structural similarity even when syntax is rearranged. Cross-language detection (someone translates a Python solution to C++ to obscure the source) requires a higher-level semantic embedding — modern systems use code-embedding models that map functionally-equivalent code to nearby vectors.
In contests, plagiarism flags result in time penalties or disqualification post-contest, not real-time rejection. Real-time enforcement would be too disruptive (false positives are common) and would tip off the cheater that detection happened.
Fourth tradeoff: language support breadth. Every supported language adds operational cost — a compiler/runtime to maintain, a sandboxed image to build, language-specific timing constants to calibrate (Python is 5-10x slower than C++ at the same algorithm). Most judges pick ~20 languages and accept that obscure-language solvers are out of scope. Mention the calibration constant — problem time limits are typically stated in C++ seconds and multiplied by a language factor for slower languages.
Fifth: a 'special judge' or 'checker.' Some problems don't have unique correct outputs (e.g., 'find any path from A to B' has many valid answers). The judge runs a problem-specific checker against the submission's output to determine correctness instead of byte-exact comparison. The checker itself is a piece of untrusted code (written by the problem author) and runs in the same sandbox as submissions.
Common mistakes
- Treating sandboxing as 'just run it in a subprocess' — untrusted code execution needs layered isolation (kernel sandbox + container + optionally VM)
- Ignoring contest spikes — start-of-contest queue depth can be 50x baseline and needs explicit handling
- Forgetting per-test-case isolation — one runaway test should be killed and the next test should start fresh, not poisoned
- Skipping plagiarism detection — at LeetCode/Codeforces scale, cheating is rampant and detection is a first-class feature
- Setting one time limit for all languages — Python is much slower than C++ at the same algorithm and needs a language-adjusted limit
Likely follow-up questions
- How would you support interactive problems where the submission and the judge communicate back-and-forth during execution?
- What changes if the platform must support GPU-accelerated submissions (e.g., a competitive ML challenge)?
- How would you design the leaderboard to update in real-time during a contest with 100K participants?
- How would you handle a 'practice mode' where the user wants to see exactly which test case their submission failed on without giving up the failing input (anti-cheat)?
- How would you implement a 'language change' feature where a user resubmits the same problem in a different language and the system carries over their progress?
Practice Design an Online Judge live with an AI interviewer
Free, no sign-up required. Get real-time feedback on your design.
Practice these liveFrequently asked questions
- How long is the Design Online Judge interview?
- 60 minutes is typical. Expect deep questions on sandboxing (the senior signal) and contest queue handling. Plagiarism detection is bonus depth.
- Do I need to know specific sandboxing primitives (namespaces, cgroups, seccomp)?
- Naming them and explaining what each does (namespaces for isolation, cgroups for resource limits, seccomp for syscall filtering) is enough. Drawing the exact seccomp filter is overkill.
- How is this different from a generic background-job system?
- Two unique constraints. (1) Sandboxing for untrusted code — generic job systems run trusted internal code. (2) Per-task strict resource limits with kill-on-exceed — generic systems retry on failure. The architecture also has unique features (test-case-level verdicts, contest fairness scheduling) that don't exist in generic queues.
- What is the most important concept for Design Online Judge?
- Sandbox isolation. The senior signal is recognizing that 'run untrusted code' is the central security problem and proposing layered isolation (kernel sandbox + container + optionally VM) rather than treating it as a subprocess problem. Contest spike handling is a close second.