Design Zoom — System Design Interview Guide
Design Zoom is a system-design interview that asks you to build a video-conferencing platform: hundreds of millions of users join real-time meetings, audio and video streams flow between participants with sub-300ms latency, and the system scales to meetings of thousands. The hard part is the media routing topology and the bandwidth math.
By Alex Chen, Founder, InterviewChamp.AI · Last verified
Reported in interviews at
- Zoom
- Microsoft
- Meta
- Cisco
Sourced from Glassdoor, Levels.fyi, and Blind interview reports.
Functional requirements
- Host or schedule a meeting with a meeting ID, optional password, and waiting room
- Join a meeting via meeting ID and stream audio/video/screen-share
- Real-time chat within the meeting
- Recording: cloud recording of audio, video, and chat for later playback
- Breakout rooms: dynamically partition a meeting into smaller sub-rooms
- Optional: live transcription and noise suppression
Non-functional requirements
- Scale: ~300M daily meeting participants, average meeting ~5 participants, max meeting ~10K participants in webinar mode
- Audio/video latency: <300ms p99 end-to-end mouth-to-ear
- Packet-loss tolerance: usable audio at 5% packet loss, usable video at 15%
- Availability: 99.99%; meetings cannot drop mid-session
- Bandwidth: adaptive bitrate from 50 kbps (audio-only fallback) to 3 Mbps (HD video)
Capacity estimation
Public 2024 scale anchors: Zoom handles ~300M daily meeting participants, peak concurrency in the tens of millions. Average meeting ~5 participants × 30 minutes; peak meetings can run to 10K participants in webinar mode.
Bandwidth math is the dominant constraint. Per-participant uplink at 720p video + audio is ~1.5 Mbps; downlink for receiving N other streams is ~1.5 Mbps × (N-1). In a naive full-mesh of 100 participants, each user uploads 99 streams × 1.5 Mbps = 150 Mbps. This is infeasible — no consumer connection sustains 150 Mbps uplink. The architecture must reduce per-participant bandwidth (see deep dive).
Server-side bandwidth: at peak ~30M concurrent participants × 1.5 Mbps uplink = ~45 Tbps inbound. Even with 10x compression and multi-stream optimization the platform handles tens of Tbps continuously. Servers are distributed across hundreds of geographic points-of-presence (POPs) to keep network paths short.
Storage: cloud recordings are voluminous — a typical 1-hour meeting at 720p with 5 active participants compresses to ~500 MB. At 1M recorded meetings/day × 500 MB = 500 TB/day = ~180 PB/year of recordings, the dominant storage cost. Recordings live in tiered object storage with hot/warm/cold lifecycle.
Metadata storage is small: meeting records ~1 KB/meeting × millions of meetings/day = a few GB/day. Chat messages and reactions during meetings are ephemeral by default, persisted only on recording-enabled meetings.
High-level design
Four core services: signaling, media routing, recording, and meeting metadata. The media-routing topology is the architectural centerpiece.
Signaling service handles meeting joining, password validation, and participant state. Clients connect via a persistent connection (WebSocket) to the nearest regional signaling node. The signaling node tracks meeting membership, mute states, and roles (host, panelist, attendee). Participant join/leave events propagate to all participants in the meeting via the signaling fanout.
Media routing is the heart. Each meeting is assigned to a Selective Forwarding Unit (SFU) server in a regional POP near most participants. Participants send their audio/video streams once to the SFU; the SFU forwards copies of each participant's stream to other participants. This is fundamentally different from a Multipoint Control Unit (MCU) which decodes-mixes-reencodes (high CPU); SFU just forwards packets (low CPU, high bandwidth per server, but vastly more scalable).
Client uplink: each participant sends ONE outgoing stream to the SFU (one upload). Client downlink: receives N-1 streams (or a subset for large meetings — see deep dive). This is the bandwidth reduction the architecture provides.
For very large meetings (1K+ participants), a single SFU can't handle all forwarding. The system uses a cascaded SFU mesh — multiple SFU servers exchange streams between themselves, each serving a subset of participants. Cascading adds one hop of latency but scales arbitrarily.
Recording service runs as a 'virtual participant' that joins a meeting, receives all streams via the SFU like any other participant, and composes them into a recorded file. Recording servers handle the decode-compose-encode that the SFU avoids. Recordings are written to object storage and indexed by meeting metadata for later playback.
Meeting metadata (room, scheduled time, host, participants, recording URL) lives in a sharded relational store. This is small data, dwarfed by the recording bytes in object storage.
Deep dive — the hard problem
Two deep dives: the SFU vs MCU topology choice, and the simulcast/SVC bandwidth optimization.
SFU vs MCU topology: the textbook conferencing options are full-mesh (every participant sends to every other directly), MCU (server decodes all streams, mixes into one composite, sends the composite to each participant), and SFU (server receives each stream once, forwards copies to each participant without re-encoding).
Full-mesh fails above ~6 participants — uplink bandwidth explodes per participant.
MCU works for small meetings and is CPU-bound at the server; each meeting requires a full video decode + mix + encode per second of meeting, which is expensive at scale. MCUs are still used in dial-in audio bridges (audio mixing is cheap) but rarely for video at scale.
SFU is the production answer for video conferencing at scale. Each client encodes once (cheap on the client) and the server forwards packets (cheap on the server). The tradeoff is that each client receives N-1 streams and must decode all of them in parallel — this is fine for meetings up to ~20-50 active video streams, beyond which clients struggle. Above ~20 active streams, the SFU implements 'active speaker' selection: only the top ~9 active speakers are forwarded as full video; other participants are still in the meeting with audio but their video is muted (or sent at very low resolution).
Simulcast and SVC: to support participants with varying network conditions, each video sender encodes the same stream at multiple resolutions simultaneously (simulcast: typically 3 layers — 90p, 360p, 720p). The SFU receives all 3 layers and selects which to forward to each receiver based on receiver bandwidth. A weak-network participant receives 90p; a strong-network participant receives 720p — without the sender having to know about either.
SVC (Scalable Video Coding) goes further: a single encoded stream contains multiple decodable layers — the SFU strips off higher layers for weak networks without re-encoding. SVC is more efficient than simulcast on the network side but more complex on the codec side. Production systems often use simulcast for video and a single layer for audio.
Third tradeoff: regional routing and cascading. Each meeting is assigned to a primary SFU in the region with the most participants. For globally-distributed meetings, additional SFUs in other regions handle their local participants and exchange streams with the primary via a cascading backbone. This trades one extra hop of latency (~30-80ms depending on regions) for vastly reduced inter-region traffic. The cascading mesh is the architecture used for very large webinars and global meetings.
Fourth: noise suppression and AI features. Modern conferencing platforms run noise suppression and echo cancellation on the client side (cheap CPU per device) rather than on the server. Server-side ML features (live transcription, background blur) run only when explicitly enabled, and consume a separate pool of media-processing servers. Mention but don't drill in unless asked.
Common mistakes
- Defaulting to a full-mesh design — fails above 6 participants because uplink bandwidth doesn't scale
- Proposing an MCU as the primary topology — too CPU-expensive at scale, server can't host more than ~20 concurrent meetings per machine
- Forgetting simulcast/SVC — different participants have wildly different network conditions, and serving a single resolution to all is broken
- Treating recording as 'just object storage' — recording is an active participant that joins the meeting, consumes server capacity, and decodes-composes-encodes
- Ignoring the active-speaker selection for large meetings — without it, clients with 50+ active video streams crash
Likely follow-up questions
- How would your design support a 10K-participant webinar with one active speaker and 10K passive viewers?
- What changes if you have to support end-to-end encrypted meetings where the server can't see the media?
- How would you implement breakout rooms that dynamically partition an active meeting?
- How would you handle a global meeting where participants are spread across 5 continents with very different network conditions?
- How would you implement real-time captions that scale to thousands of concurrent meetings?
Practice Design Zoom live with an AI interviewer
Free, no sign-up required. Get real-time feedback on your design.
Practice these liveFrequently asked questions
- How long is the Design Zoom system-design round?
- 60 minutes typical. Senior+ rounds expect the SFU vs MCU discussion plus simulcast/SVC and the recording architecture.
- Do I need to know specific video codecs (VP8, H.264, AV1)?
- Naming one or two is fine; the architecture discussion lives above the codec layer. Saying 'VP8 or H.264 with hardware acceleration where available' is enough. Drawing the codec specifics is overkill.
- Should I cover Zoom-specific features like virtual backgrounds?
- Mention them as client-side ML features running on the device. Drilling into the model architecture wastes time you need for the media-routing discussion.
- What is the single most important concept for Design Zoom?
- SFU as the central routing topology, with simulcast for heterogeneous receivers. Almost every senior signal hinges on whether the candidate avoids full-mesh and proposes selective forwarding correctly.