Why isn't pure Elo good enough for modern matchmakers?

Elo treats every player as a single deterministic point on a number line, but new players' ratings are highly uncertain — early matches with them are essentially random. A Bayesian model with explicit uncertainty produces dramatically better early-game matches and converges faster to true skill.

Do I need to deep-dive on the rating math?

Mention the model family (Elo or Bayesian-with-uncertainty) and the key inputs (rating-difference, outcome, optionally margin-of-victory). Drilling into the update equation is out of scope unless the interviewer pushes — focus on the system properties: deterministic, replayable, server-attested.

How is matchmaking different from a dating-app match (Design Tinder)?

Dating matches are bilateral and unconstrained on count (one user matches with one other). Game matchmaking is N-way (10 players form a single lobby) and has strict balance constraints (the two teams must be near-equal in summed skill). The queue search is fundamentally different — N-way constraint satisfaction, not pairwise scoring.

What is the most important concept for this question?

Queue dynamics with widening windows. The naive design uses a fixed skill window and stalls; the production design adapts the window to player density at each rating band. Demonstrating this tradeoff signals senior-level understanding.

System Design Questions

Design a Real-Time Gaming Matchmaker — System Design Interview Guide

Design a Real-Time Gaming Matchmaker is a system-design interview that asks you to pair skill-rated online players into balanced matches: a player clicks 'find game', and within 10-30 seconds they're in a lobby with 9 other players of similar skill, on a server in their region, with acceptable network latency. The hard part is the queue dynamics — too strict on skill matching and queues stall; too loose and matches are one-sided — plus the rating computation that has to be incremental and tamper-resistant.

By Sam K., Founder, InterviewChamp.AI · Last verified 2026-05-25

Reported in interviews at

Microsoft
Amazon
Roblox
Activision
Riot Games

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

Player enters matchmaking queue with game mode, region preference, party composition
System produces a balanced lobby (e.g., 10 players, 5-vs-5) within a target wait time
Compute and update each player's skill rating after each match
Allow parties of friends to queue together (the party's combined skill must still produce a balanced match)
Support multiple game modes, each with its own queue and rating
Allow players to cancel queue or leave a lobby pre-game

Non-functional requirements

Scale: 10M concurrent online players at peak, ~1M concurrent in queue, 500K matches/hour
Queue latency: median wait time <30 seconds; p95 <90 seconds; p99 <3 minutes
Match-quality: at least 80% of matches within 100 rating-point skill spread
Network latency: server selected within 80ms RTT of all 10 players for >95% of matches
Availability: 99.9% on matchmaking; players tolerate brief queue freezes more than mid-match drops

Capacity estimation

Public assumptions for a popular online competitive game: ~50M MAU, ~10M peak concurrent, ~1M concurrently in matchmaking. Match rate: 1M / 10 players-per-match / 30-second-avg-wait = ~3,300 matches starting per second peak; over an hour, ~500K matches.

Queue state: 1M players × ~200 bytes per queue entry (player_id, rating, region, party_id, joined_at, preferences) = 200 MB. Easily fits in memory across a queue tier.

Match history: 500K matches/hour × 24 hours = 12M matches/day × ~5 KB per match record (player ids, scores, duration, server, replay pointer) = 60 GB/day, ~22 TB/year. The replay binary lives in object storage (separate from the relational match-record store); a typical replay is 10-50 MB.

Rating writes: 500K matches/hour × 10 players = 5M rating updates/hour = ~1,400/sec. Each update is a small row write keyed by player_id, sharded by player_id. Trivial.

Region partitioning: assume 6 regions (NA-East, NA-West, EU-West, EU-Central, Asia-East, OCE). Each region runs its own queue tier locally. Cross-region queuing happens only after wait exceeds a threshold and we accept higher network latency to find a match. Per-region peak: 1M / 6 = ~170K in queue per region.

Queue dynamics: wait time is governed by player density at a rating band. At a popular rating (say, the median, 1500), the band [1450, 1550] sees thousands of players join per minute — matches form in seconds. At the tail (rating 2700+, top 0.1%), the band might see one player join every 30 seconds — wait times balloon to minutes. Production systems handle this by widening the skill window over time: start at ±50 rating, widen by +25 every 15 seconds until a match is found or max-spread is hit.

High-level design

Four core services: queue, matcher, rating, match-coordinator.

Queue service maintains the live in-memory queue per region per game mode. It's not a FIFO — it's a multi-dimensional index keyed by (rating, region, party_size, mode). Each player entry holds metadata: arrival time, skill rating, party_id (if grouped), region, preferences. The queue is sharded by region and replicated within a region for fault tolerance; a queue node failure should not eject players mid-search.

Matcher runs the actual pairing algorithm. It pulls candidate players from the queue, applies the matching rules (skill window, party-composition constraints, region constraint), and produces a lobby. The matcher does NOT scan the entire queue per cycle — it operates on rating bands. Each band has its own search loop; players in adjacent bands are considered only when their wait time exceeds the widening threshold. The matcher emits lobby-formed events to the match-coordinator.

Rating service computes and stores each player's skill rating (commonly an Elo-derived rating or a Bayesian rating that tracks both a skill estimate and an uncertainty value). After each match, it ingests the result, computes the rating update for each of the 10 players, writes back to the player-rating store (sharded by player_id, eventually-consistent). Rating updates are based on (expected outcome | rating_difference) vs. (actual outcome); a player who loses to a much lower-rated opponent loses more rating than one who loses to a much higher-rated opponent.

Match-coordinator handles the lifecycle from lobby-formed to match-ended. It selects a game server (a stateless game-server-fleet manager handles allocation), confirms all 10 players are still online and ready, opens the connection, listens for match-result events, then triggers the rating update. If a player disconnects pre-match, the coordinator reverts the lobby and re-queues the remaining 9 (or fails the whole lobby if the disconnect was the host).

Party handling: a party of friends enters the queue as a single unit. The party's effective rating is a function of member ratings (commonly the average, or a weighted blend that penalizes high-spread parties to prevent 'smurfing' — a high-rated player carrying a low-rated friend). The matcher treats the party as a single queue entry with that effective rating, and ensures the opposing party (or a balanced collection of solo+small-parties on the other team) hits the same effective rating.

Deep dive — the hard problem

Two deep dives: the skill-rating model, and queue dynamics with region partitioning.

Skill rating model — three viable patterns.

Pattern A: Elo. The classic chess rating. Each player has a single number; after each game, the winner takes points from the loser proportional to the rating difference. Simple, well-understood, easy to debug. Tradeoff: doesn't model uncertainty, so a brand-new player and a 1000-game player with the same rating are treated identically. New players churn through provisional ratings that lag their true skill.

Pattern B: Bayesian rating (mean + variance). Each player has a skill mean and an uncertainty variance. New players start with high variance, which decreases as games accumulate. The matcher prefers pairing players with overlapping skill distributions, not just identical means — a player at 1500±50 is a fine match for 1500±50, but a poor match for 1500±300 (the high-variance player could actually be 1200 or 1800). Production-grade and widely used.

Pattern C: per-mode skill plus team skill. The skill rating is multidimensional: rating in (mode_A, mode_B, ...) and a separate rating for team-based vs. solo play. Adds complexity but matches reality — a player who's great at 1-vs-1 may be mediocre in 5-vs-5. Pick this only if the product genuinely has very different modes; otherwise it adds queue fragmentation (each mode's queue is smaller, wait times grow).

Rating-update path is the critical surface: it must be incremental (O(1) per game per player) and tamper-resistant (clients cannot self-report results). Production systems take the match result from the game-server-attested replay log, not from the client. The rating-update is computed deterministically from (player_ratings_before, match_outcome) — a re-runnable function. If a cheating report leads to a retroactive disqualification, the rating service replays the update with the cheater removed; subsequent matches the cheater played get their updates rolled back and reapplied in order. Replayability matters more than absolute correctness on the first pass.

Queue dynamics: the central tension is skill spread vs. wait time. Tight skill matching produces balanced, fun matches but stalls queues; loose matching produces lopsided matches that drive churn. Production tactic: skill-window widening over time. Player joins at rating 1500. First 15 seconds: search [1450, 1550]. Next 15: [1425, 1575]. Next 15: [1400, 1600]. Cap at, say, [1300, 1700]. If still no match at the cap, broaden region constraints (allow cross-region matches with up to +50ms RTT) before broadening skill further.

Region partitioning: each region runs its own queue and matcher. 95%+ of matches form within-region with good network latency. The cross-region escape hatch kicks in for tail players (top 0.1% by rating in low-population regions). Implementation: each region's queue periodically publishes a 'wait-time pressure' metric per rating band; a cross-region matcher monitors these and proactively pairs players from two regions when one has surplus and the other has deficit. The opening latency hit is well-defined (each player gets a server with ~+50ms RTT vs. their preferred region) and accepted as the alternative to a 5-minute wait.

Third tradeoff: party imbalance. A party of (1800, 1800) vs. (1500, 2100) has the same average rating but very different match dynamics — the 2100 will dominate while the 1500 gets carried. Production matchers compute a 'party spread penalty' and refuse to pair high-spread parties against low-spread parties. Mentioning this signals you understand that average-of-ratings is an insufficient summary statistic.

Fourth: queue manipulation and smurfing. Players intentionally lose to drop rating and stomp lower-rated opponents. Detection: track win-rate vs. expected win-rate over a rolling window; a player who loses 20 games then wins 20 with a 90% rate is suspicious. Production response: rating decay on suspected accounts, shadow-queue them with other smurfs. Mention this as an operational concern; interviewers reward the candidate who raises it.

Common mistakes

Treating the queue as a FIFO — matchmaking is a multi-dimensional search problem, not a linear one
Forgetting party composition — a 5-friend party with a 600-rating spread is unfair against a balanced 5 solo players
Skipping the widening-window dynamic — a static skill window stalls queues at the rating tails
Trusting the client for match results — rating updates must come from the game server, not the player
Designing one global queue without region partitioning — network latency would tank match quality

Likely follow-up questions

How would you design a 'ranked' season system where ratings reset every 3 months and rewards are tier-gated?
What changes if you have to match across multiple platforms (PC + console + mobile) with different input latencies?
How would you support a 'role queue' where each lobby needs specific roles filled (1 tank, 2 healers, 2 damage)?
How would you detect and shadow-ban smurfs without breaking legitimate alt-accounts?
How would you handle a scenario where 20% of players in a region simultaneously disconnect due to an outage?

Related system design scenarios

Frequently asked questions

Why isn't pure Elo good enough for modern matchmakers?: Elo treats every player as a single deterministic point on a number line, but new players' ratings are highly uncertain — early matches with them are essentially random. A Bayesian model with explicit uncertainty produces dramatically better early-game matches and converges faster to true skill.
Do I need to deep-dive on the rating math?: Mention the model family (Elo or Bayesian-with-uncertainty) and the key inputs (rating-difference, outcome, optionally margin-of-victory). Drilling into the update equation is out of scope unless the interviewer pushes — focus on the system properties: deterministic, replayable, server-attested.
How is matchmaking different from a dating-app match (Design Tinder)?: Dating matches are bilateral and unconstrained on count (one user matches with one other). Game matchmaking is N-way (10 players form a single lobby) and has strict balance constraints (the two teams must be near-equal in summed skill). The queue search is fundamentally different — N-way constraint satisfaction, not pairwise scoring.
What is the most important concept for this question?: Queue dynamics with widening windows. The naive design uses a fixed skill window and stalls; the production design adapts the window to player density at each rating band. Demonstrating this tradeoff signals senior-level understanding.