Design Discord — System Design Interview Guide
Design Discord is a system-design interview that asks you to build community voice + text chat: users join servers (each server has many channels), exchange text messages in real time, and hop into low-latency voice rooms with dozens of concurrent speakers. The hard part is voice infrastructure plus channel fanout at server scale.
By Alex Chen, Founder, InterviewChamp.AI · Last verified
Reported in interviews at
- Discord
- Meta
- Roblox
- Twitch
Sourced from Glassdoor, Levels.fyi, and Blind interview reports.
Functional requirements
- Users join servers; each server has many text and voice channels
- Send text messages to a channel; receive in real time on all online channel members
- Join a voice channel; speak and hear other participants with sub-300ms latency
- Direct messages between users (1:1 and small group DMs)
- Roles and permissions per channel (who can post, who can voice, who can moderate)
Non-functional requirements
- Text message delivery: <500ms p99 end-to-end for connected users
- Voice latency: <300ms mouth-to-ear p99 for participants on the same continent
- Scale: ~200M MAU, ~15M concurrent users, ~25B messages/day, peak ~1.5M voice participants concurrently
- Availability: 99.9% for voice (brief drops acceptable), 99.99% for text
Capacity estimation
Public Discord scale (2022-2024): ~200M+ monthly active users, ~15M+ concurrent online at peak, ~25B+ messages/day = ~290K messages/sec average. Servers (called 'guilds') range from 2 friends to 10M+ members — the largest few thousand servers dwarf the millions of small ones in traffic. Average user is in ~10-15 servers; most active in 3-5.
Message storage: 25B × ~150 bytes (smaller than Slack on average — gaming chats lean shorter) = ~4 TB/day text. Annual: ~1.5 PB. Attachments are uploaded separately to object storage and referenced by URL.
Voice is the unique scaling dimension. At peak, ~1.5M users are in voice channels simultaneously. Voice traffic per user is ~25-50 kbps (Opus codec); total voice egress is ~75 Gbps at peak. Voice is latency-critical (<300ms mouth-to-ear) which dictates geographically distributed voice servers — every voice room runs on a server geographically close to its participants. Voice servers are stateful and per-room; one server can host hundreds of rooms with up to ~25 active speakers each.
High-level design
Four core domains: text channels, voice channels, user/server graph, and connection routing.
Text: clients hold a persistent connection (WebSocket) to a gateway, joined to all their server channels. The gateway authenticates and joins the user to channel rooms in memory. Sent messages route through the gateway to a message-handler service which writes the message to a durable sharded store (sharded by channel_id), then broadcasts to all connected channel members via the gateway. The message store partitioned by channel_id makes channel scrollback (range query on channel_id ORDER BY seq DESC LIMIT 50) a single-shard read.
Voice: voice channels are not routed through the same message gateway. When a user joins a voice channel, they're handed off to a voice server — a stateful media server that holds an active SFU (Selective Forwarding Unit) session per channel. The voice server receives each participant's audio packets, decodes only enough to know speaker activity, and forwards each packet to every other participant in the same channel. SFU is the standard pattern because it scales linearly with channel size (vs. peer-to-peer mesh which is O(N²)). Voice servers are geographically distributed; the gateway picks the voice server closest to the median latency of the joining users.
User/server graph: server membership, roles, and per-channel permissions live in a relational store. Permissions are computed per (user_id, channel_id) and cached. On role changes, cache entries are invalidated for affected users.
Connection routing: the gateway tier maintains the user_id → connected_gateway_id mapping in an in-memory routing tier. When a message is sent, the routing tier locates each channel member's gateway and forwards. This is identical in shape to Design WhatsApp connection routing — that's the reusable pattern.
Deep dive — the hard problem
Two deep dives: voice infrastructure and the giant-server fanout problem.
Voice deep dive: an SFU (Selective Forwarding Unit) is the standard architecture for many-to-many voice. Each participant sends a single uplink stream to the voice server. The server forwards that stream to every other participant. Bandwidth is N inbound + N(N-1) outbound for a channel of size N. At N=25 participants in a voice channel, that's 25 inbound + 600 outbound streams — easily handled by one server. Discord channels can grow to ~99 voice participants; at the edge, the server selects the few loudest speakers (typically 3-5 active at a time) and only forwards those, sending silence frames for the rest. This 'voice activity detection' selection is what keeps bandwidth bounded at large channel sizes.
Voice servers must be close to participants. The server selection problem: a voice channel with 20 users in Europe and 5 in North America should run on a European server (median latency wins) with cross-region forwarding to the NA participants. Discord runs hundreds of voice servers worldwide; the gateway picks one based on participant geolocation and current load. Mention this geographic-routing problem; it's the interviewer's favorite voice probe.
The second deep dive is the giant-server fanout problem. A small fraction of servers have millions of members (e.g. Midjourney, MEE6 community, large game studios). A single message in a popular text channel could fan out to 500K+ connected members. Naïve fanout — looking up every channel member and pushing — saturates the gateway tier. The standard solution mirrors Design Twitter's celebrity problem: hybrid push/pull. Small channels get push (gateway forwards on every send). Large channels (>~10K active members) get pull — clients periodically request 'new messages since seq X' from the message store. Active mid-channel users get push to their own client; passive lurkers see messages on next poll. The threshold is tunable and reflects the cost of a push to N clients vs. a pull from M clients.
Third: presence and typing indicators. Per-server presence (who's online) is a hot read. At a server with 10M members, computing 'who's online right now' for the sidebar is expensive. Discord renders 'recent activity' (typing in a channel, just sent a message) rather than full member-list status. Mention this UX-level optimization — it sidesteps an O(N) problem.
Common mistakes
- Routing voice through the same gateway as text — voice needs low-latency SFU servers, not message routers
- Designing voice as peer-to-peer mesh — collapses at >5 participants per channel
- Pushing every text message to every member of a 1M-member server — needs hybrid push/pull
- Treating voice activity detection as optional; at large channel sizes it's the only way to bound bandwidth
- Forgetting geographic routing for voice servers — interviewer pushes 'what about users in different regions'
Likely follow-up questions
- How would you support video calls in addition to voice in the same SFU architecture?
- What changes if a server reaches 50M members (the largest public servers)?
- How would you implement message search across years of channel history?
- How would you handle a large 'announcement-only' channel where 1 admin broadcasts to millions of passive members?
- How would you scale voice to a 1000-person 'stage' or audio-event channel?
Practice Design Discord live with an AI interviewer
Free, no sign-up required. Get real-time feedback on your design.
Practice these liveFrequently asked questions
- How long is the Design Discord system-design interview?
- 45-60 minutes. Discord's loop covers messaging + voice + permissions explicitly. Source: Glassdoor Discord 2022-2024 reports plus Levels.fyi senior-engineer interview writeups.
- Do I need to know WebRTC in detail for the voice part?
- Knowing the SFU concept is enough. Naming WebRTC as the browser/client-side protocol is fine. Drawing ICE/STUN/TURN negotiation is bonus signal but not required unless this is a media-team-specific interview.
- Is Design Discord harder than Design WhatsApp?
- Slightly different shape. WhatsApp is heavy on persistent connections and e2e encryption; Discord adds voice infrastructure. Most interviewers consider Discord slightly harder because voice + large-server fanout both need to be solved within the time box.
- Should I cover screen share or video?
- Briefly in the deep dive — same SFU architecture as voice, higher bandwidth per stream. Spending 10 minutes on video specifics steals time from the core voice + text design that interviewers grade hardest.