Skip to main content

System Design Questions

Design a Stock Exchange — System Design Interview Guide

Design a stock exchange asks you to build the matching engine + order book + market-data distribution at the heart of an electronic trading venue. Buyers submit bids, sellers submit asks, the engine matches them under strict price-time priority, and confirmations + market data flow back out — all in single-digit microseconds. The hard part is deterministic matching at extreme low latency with perfect order replay.

By Alex Chen, Founder, InterviewChamp.AI · Last verified

Reported in interviews at

  • Citadel
  • Jane Street
  • Two Sigma
  • Bloomberg
  • Nasdaq

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

  • Accept orders from member firms (limit, market, IOC, FOK, stop) over a low-latency wire protocol
  • Maintain an order book per symbol with bids and asks sorted by price-time priority
  • Match incoming orders against the book under strict price-time priority and emit trade confirmations
  • Cancel, modify, and replace existing orders deterministically
  • Publish market data (best bid/ask, full depth, trade prints) to subscribed participants in real time
  • Maintain an append-only audit log of every order, modification, and trade for regulatory replay

Non-functional requirements

  • Matching latency: <10 microseconds p99 from order ingress to match decision (tick-to-trade end-to-end is single-digit microseconds for co-located members)
  • Throughput: ~1M orders/sec across all symbols at peak, ~100K trades/sec at peak
  • Determinism: identical input sequences must produce identical match outputs (regulatory replay requirement)
  • Fairness: price-time priority must be strict — no order is matched ahead of a same-price earlier order
  • Availability: 99.999%+ during trading hours; any downtime is a market halt with regulatory consequences

Capacity estimation

Anchor on US equities scale. A major exchange processes ~10M order events/sec across all listed symbols on the worst day (market open + close are the bursty windows). Trade events are ~10% of order events — most orders are cancels and modifies, not matches. Peak symbol-level burst on a high-volume stock can hit ~100K events/sec on that one symbol, which sets the per-symbol throughput requirement.

Latency budget breakdown for the 10-microsecond target: ~1us network ingress (kernel-bypass NIC), ~1us protocol decode, ~3us order-book lookup + match decision, ~1us trade-confirmation message construction, ~2us market-data publication, ~2us audit-log write (asynchronous, off the hot path). Every step is measured in nanoseconds-to-microseconds. The matching engine runs single-threaded per symbol — locking is the enemy of latency.

Storage estimates: ~10M events/sec × 23,400 trading-seconds/day = ~234B events/day. Each event ~100-200 bytes (order_id, symbol, side, price, qty, type, timestamp, member_id) = ~50 TB/day raw event log. Compressed and indexed for regulatory replay, ~15 TB/day net. Annual retention (typically 7 years for US equities under SEC Rule 17a-4) means petabyte-scale archives.

Memory footprint for live order books: ~10K listed symbols × ~1 MB per symbol (active orders + price levels + metadata) = ~10 GB. Easily in memory on one node per symbol-class shard. The active-order count per symbol is typically thousands-to-tens-of-thousands at any moment.

Market-data volume: best bid/ask updates fire on every level change. ~10M order events/sec × ~30% of which change the book top = ~3M top-of-book updates/sec. Full-depth feed is heavier (~10M+ updates/sec). Subscribers consume different feed slices — a co-located HFT subscribes to full depth; a retail broker subscribes to top-of-book + trades only.

The shape that matters: this is not a high-throughput problem in raw bytes. It's a determinism + latency problem. ~1M orders/sec is small compared to web traffic, but every order must match deterministically and the matching engine must be replayable bit-for-bit from the audit log.

High-level design

Five core components: order gateway, matching engine, market-data publisher, audit-log persistence, and the regulatory-replay system.

The order gateway is the member-firm-facing surface. Members connect over a session-based binary protocol (FIX or a proprietary low-latency variant) — typically over a kernel-bypass NIC for co-located members. The gateway authenticates the session, validates order parameters (price tick rules, lot-size rules, symbol exists), assigns a sequence number, and forwards the order to the matching engine. Sequencing is critical: the assigned sequence number is the authoritative ordering for all downstream processing, including replay. Gateways sit in front of the matching engine and shed load by rate-limiting per member.

The matching engine is the heart of the system. It runs single-threaded per symbol (or per small group of related symbols) to eliminate locking. The engine holds the order book in memory as two sorted price-level structures: bids descending by price, asks ascending by price. Each price level is a FIFO queue of orders at that price. On an incoming order: if marketable (a buy crosses the best ask or vice versa), match against the opposite side starting at the best price and the oldest order at that price; emit trade events for each match; reduce or remove matched orders; if any quantity remains and the order is a limit, place the residual on the book; if the order is IOC, cancel any residual.

Determinism is achieved by single-threaded processing of a sequenced input stream. Two engine instances given the same input sequence produce bit-identical outputs. This enables hot-standby failover: a secondary engine consumes the same sequenced input and runs in lockstep; on primary failure the secondary becomes primary with zero data divergence.

The market-data publisher consumes the engine's output stream (trade events + book-update events) and broadcasts to subscribers. The hot path uses multicast to co-located subscribers — single packet, many recipients, no per-subscriber CPU on the publisher. Slower subscribers receive a TCP feed with sequencing and retransmit support. Both feeds carry the same sequence numbers; a subscriber detecting a gap requests the missing range via the retransmit channel.

Audit-log persistence writes every input event and every output event to an append-only log on durable storage. The log is sequenced, time-stamped (often to nanoseconds via a hardware time source), and replicated across multiple data centers. The log is the source of truth for regulatory replay: given any sequence range, the system must be able to re-run the matching engine and produce the exact same trades. The audit log is also the foundation for post-trade reconciliation against member-firm records.

Regulatory replay: a separate offline cluster can rebuild any historical book state by replaying the audit log from a snapshot. Regulators query specific trades and require the venue to produce the full context (the book state immediately before the match, the matched orders, the timing). The replay system must be bit-identical to the live engine — the same code path, just driven by recorded input rather than live input.

Deep dive — the hard problem

Three deep dives: order-book data structures, matching-engine determinism, and market-data multicast with gap recovery.

Order-book data structures — the active book has two operations dominating latency: insert a new order at a price (potentially a new price level), and remove an order (on cancel or full fill). Both must be O(log n) or better in the number of price levels.

The standard implementation is a sorted map from price to a FIFO queue, plus a hash map from order_id to the queue node for O(1) cancellation. For bid side, prices are sorted descending; for ask side, ascending. The best bid is the first element of the bids map; the best ask is the first element of the asks map. Inserting at a new price is O(log n) in number of price levels (typically a tree-backed map). Inserting at an existing price is O(1) (push onto the FIFO). Canceling by order_id is O(1) lookup + O(1) removal from the FIFO via a doubly-linked list.

For very hot symbols with tight tick rules (lots of price levels), some engines use a flat array indexed by price-ticks-from-mid (array index = (price - mid_anchor) / tick_size). This collapses the price-level lookup to O(1) at the cost of memory proportional to the price range. Hybrid implementations switch between tree and array based on observed price density.

Matching is a loop on marketable orders: take the best price on the opposite side, take the oldest order at that level, match min(incoming_remaining, resting_remaining), emit a trade, advance. The loop is tight and branch-predictable; production implementations hand-tune it for cache locality and branch prediction.

Matching-engine determinism — the single-threaded constraint is non-negotiable. Once threading enters, the schedule depends on OS scheduling and the output is non-deterministic. Production engines are usually pinned to a single CPU core, with the core isolated from other workloads (no scheduler interference) and IRQs steered away. Memory allocation on the hot path is forbidden — all order objects come from pre-allocated pools, freed back to the pool on completion. The hot path never touches the heap.

The input to the engine must also be deterministically ordered. The order gateway assigns sequence numbers atomically; the engine processes orders strictly in sequence. If two members submit orders at literally the same moment, the gateway's sequencer breaks the tie deterministically (typically by member-id + sub-microsecond timestamp). The tie-break rule is published and consistent — members write algorithms that depend on this rule, so it must not change between releases.

Failover: a secondary engine consumes the same sequenced input stream and runs the same code. When the primary fails, the secondary becomes the primary. Because both engines are deterministic on the same input, the secondary's book state is bit-identical to the primary's at the failover point. The cutover is a session-level switch on the order gateways; existing orders survive failover unchanged.

Market-data multicast with gap recovery — multicast is unreliable transport. A subscriber can miss a packet for many reasons (NIC drop, switch buffer overflow, multicast tree reconvergence). Every market-data message carries a strictly-monotonic sequence number per symbol; subscribers detect a gap when the next received sequence is not previous+1. On gap detection, the subscriber requests the missing range over a TCP retransmit channel. The publisher maintains a short retransmit buffer (last few seconds of messages per symbol) and serves the request from that buffer.

The retransmit channel must not interfere with the live multicast path. Production exchanges run the multicast publisher and the retransmit responder as separate processes (sometimes separate machines) so a slow retransmit request can never back up the live feed. Subscribers that fall too far behind reset by snapshot — a periodic full-book snapshot is published, and the subscriber discards stale state and rebuilds from the snapshot plus incremental updates from the snapshot sequence number forward.

Fourth deep dive: timestamps. Sub-microsecond timing matters for regulatory reporting (US Reg NMS, MiFID II). The matching engine uses a hardware-derived timestamp (PTP-disciplined, often a precision time card on each server) and writes the timestamp to every event. The timestamp is not just for display — it's used in best-execution analysis by member firms and in cross-venue order routing.

Common mistakes

  • Treating the matching engine as multithreaded — locking destroys both latency and determinism
  • Skipping the failover mechanism — without a deterministic hot-standby, an engine failure is a market halt
  • Forgetting the audit log + replay system — regulators require bit-identical replay of any historical period
  • Designing market data as point-to-point TCP per subscriber — at 10K+ subscribers this doesn't scale; multicast is the only answer
  • Ignoring the gap-recovery channel — multicast is lossy and subscribers must be able to fill gaps without delaying the live feed

Likely follow-up questions

  • How would you add support for opening and closing auctions where all orders accumulate and match at a single clearing price?
  • What changes if you need to support cross-venue best-execution routing where orders may be routed away from your venue?
  • How would you detect and prevent self-trade (a member's buy crossing their own sell)?
  • How would you implement circuit breakers that halt trading when prices move beyond a threshold in a short window?
  • How would you build a regulatory-replay system that can reproduce any historical book state on demand?

Practice Design a Stock Exchange live with an AI interviewer

Free, no sign-up required. Get real-time feedback on your design.

Practice these live

Frequently asked questions

How long is a Design Stock Exchange interview?
60-90 minutes at trading-firm and exchange interviews (Citadel, Jane Street, Nasdaq). Expect deep coverage of matching-engine determinism + order-book data structures + market-data distribution. The bar is higher than a typical web-scale system-design — expect to discuss microseconds, cache lines, and kernel bypass.
Do I need to know FIX protocol in detail?
Naming FIX as the typical wire protocol is enough. Specific tag numbers and message types are bonus signal but not required unless this is an exchange-platform-team interview where FIX expertise is the role.
What's the senior-bar topic?
Determinism. Specifically: explain why the matching engine is single-threaded, why memory allocation is forbidden on the hot path, and how a hot-standby achieves zero-divergence failover by running the same deterministic code on the same sequenced input. If you propose a multithreaded matcher, that's a hard no at senior+.
Should I discuss order types in detail?
Cover the major types (limit, market, IOC, FOK, stop, peg). Specific exotic types (iceberg, midpoint-peg, discretionary) are bonus signal but not required unless the interviewer probes.