Design Spotify — System Design Interview Guide
Design Spotify is a system-design interview that asks you to build a music streaming platform: hundreds of millions of users browse a catalog of ~100M tracks, stream audio with sub-second start time, and get personalized recommendations. The hard part is low-latency audio delivery plus a recommendation pipeline at catalog scale.
By Alex Chen, Founder, InterviewChamp.AI · Last verified
Reported in interviews at
- Spotify
- Apple
- Amazon
- Meta
Sourced from Glassdoor, Levels.fyi, and Blind interview reports.
Functional requirements
- Search and browse a catalog of ~100M tracks by title, artist, album, or playlist
- Stream a track on demand with adaptive bitrate (quality adjusts to network)
- Create, share, and follow playlists; collaborative playlists where multiple users add tracks
- Personalized recommendations (Discover Weekly, Daily Mix, Release Radar style feeds)
- Offline downloads for premium users (track encrypted on device, expires on subscription end)
Non-functional requirements
- Start-to-first-byte latency: <200ms p99 from track tap to audio playback start
- Availability: 99.99% for playback; degraded recommendations acceptable, broken playback is not
- Scale: ~600M MAU, ~250M paid, ~100M tracks in catalog, peak ~50M concurrent streams
- Rebuffer rate: <1% of playback time across all sessions
Capacity estimation
Anchor on Spotify's 2024 public scale: ~640M monthly active users, ~250M premium subscribers, ~100M tracks in catalog, ~10B playlists, ~6B podcast episodes. Daily stream count: ~5B track plays/day = ~60K streams started/sec average, peak ~150K. Average track is ~3-4 minutes at ~128 kbps for free tier, ~320 kbps for premium = ~3-10 MB per stream.
Storage for the audio catalog: 100M tracks × ~10 MB per track (multiple bitrate encodings) = ~1 PB raw. Add transcoded variants (96/160/320 kbps + HLS/DASH segments) and the realized catalog is ~3-5 PB in object storage, served via a globally distributed edge cache. Bandwidth at peak: 50M concurrent streams × 128 kbps average = ~6.4 Tbps egress — this is the dominant infrastructure cost line and the reason every streaming product builds heavily on edge caching.
Metadata is also large: 100M tracks × ~2 KB metadata + 10B playlist edges = ~25 TB hot metadata. Search index over track + artist + album text is another ~500 GB inverted index.
High-level design
Separate three axes: catalog metadata, audio bytes, and personalization. Each scales differently and lives in different stores.
The catalog metadata service holds the track/artist/album/playlist graph in a sharded relational store, fronted by an in-memory cache for hot lookups. Search runs on an inverted-index cluster fed by a change-data-capture stream off the metadata store. Browse and recommendation requests hit the metadata service; it returns track IDs and a signed URL for the audio.
Audio bytes live in object storage as pre-encoded segments. Each track is transcoded once per bitrate into a set of small (~2-10 sec) chunks suitable for HTTP adaptive streaming. Clients fetch chunks over HTTP from an edge cache; the origin object store is rarely hit because the cache hit rate on hot tracks is ~99%+. The client decides which bitrate to fetch based on observed bandwidth and adjusts mid-stream — this is the standard adaptive-streaming pattern.
Personalization is its own offline pipeline: user listening events stream into an event log, batch jobs compute user embeddings and track embeddings (collaborative filtering), recommendations are precomputed per user and written to a key-value store keyed by user_id. The home feed and Discover Weekly are reads against this precomputed key-value store — fast, no live ML serving on the hot path. Editorial playlists and 'New Releases' are managed through a CMS-fed table in the metadata service. Offline downloads (premium) are signed-URL fetches of the same audio chunks with a DRM key that expires; the client respects the expiry.
Deep dive — the hard problem
The deep dive is the recommendation pipeline at catalog scale, plus playback start-time latency.
Recommendations: a naïve approach would compute 'top N tracks for user X' live on every home-page load — infeasible at 600M users and 100M items. The standard solution is offline batch + precomputed reads. Collaborative filtering produces an embedding for each user and each track in the same vector space (~64-256 dims). Discover Weekly is the top-K nearest tracks to the user vector, minus tracks the user already played, refreshed once a week. Daily Mix segments the user's listening history into K clusters and recommends from each cluster. Release Radar joins new releases (last 14 days) with the artists the user follows. All of these are precomputed by batch jobs and written to a per-user key-value store, so the home feed reads a single key.
The tradeoff is freshness vs cost. Live recommendation (ML serving on every request) is fresh-to-the-second but expensive at 600M users. Batch recomputation is cheap but stale by hours/days. Production systems blend: precomputed base feed + a small live re-ranker that boosts very recent listens. Mention this hybrid as your answer.
Playback start-time: 200ms first-byte is hard because the network round-trip alone can be 50-100ms on mobile. Three techniques compound. First, edge caching: every popular track is cached at the edge node closest to the user, so the round-trip is to the edge, not to origin. Second, segment prefetching: when the user hovers over a track or queues the next one, the client starts fetching the first segment before tap. Third, optimistic transcoding: serve a low-bitrate first segment immediately while the high-bitrate segment downloads in parallel, so playback starts on the cheap chunk and seamlessly upgrades. Mention all three and tradeoffs (prefetch costs bandwidth on hover; optimistic transcoding adds complexity).
Last hard problem: the celebrity-artist hot-spot. When a new Taylor Swift album drops, 50M users open the same playlist at the same time. The catalog metadata for that album becomes a hot key; the audio chunks become hot objects. The metadata side is fixed by the in-memory cache (one read per cache node, not per user). The audio side is fixed by edge caching with consistent hashing on the chunk URL — same chunk goes to the same edge cluster, which holds it in memory after the first fetch. Discuss the hot-key problem explicitly; interviewers reliably ask 'what about a viral track'.
Common mistakes
- Streaming audio through application servers instead of directly from edge caches — burns CPU and bandwidth on hot serving
- Live ML serving on every home-page load — doesn't scale to 600M users; missing the precomputed-feed pattern
- Treating the catalog as one giant table instead of separating metadata from audio bytes
- Ignoring adaptive bitrate — assuming all users have constant bandwidth produces unwatchable mobile experience
- Forgetting offline downloads + DRM — premium users expect tracks to keep working in airplane mode and to expire on cancellation
Likely follow-up questions
- How would you support real-time collaborative playlists where multiple users can add tracks simultaneously?
- What changes if you add live audio (Spotify-Live-style real-time DJ broadcasts)?
- How would you detect and prevent fraud — bots streaming a track 24/7 to inflate royalty payouts?
- How would you size the edge cache footprint to keep hit rate above 99% on the top 10% of catalog?
- How would you build a 'Year in Review' feature that summarizes a user's listening across 12 months?
Practice Design Spotify live with an AI interviewer
Free, no sign-up required. Get real-time feedback on your design.
Practice these liveFrequently asked questions
- How long is a Design Spotify system-design round?
- 45-60 minutes is the norm. Spotify's own loop runs ~60 minutes and explicitly expects catalog scale, audio delivery, and recommendation pipeline coverage. Source: Glassdoor Spotify 2022-2024 reports.
- Do I need to know HLS/DASH specifically for Design Spotify?
- Naming 'HTTP adaptive streaming with small segments' is enough. Knowing HLS vs DASH manifest differences is bonus signal but not required. The interviewer wants to hear 'segmented audio, adaptive bitrate, fetched over HTTP from an edge cache'.
- Is Design Spotify easier than Design Netflix?
- Slightly. The audio bytes are 50-100x smaller than video, so the bandwidth and storage math is gentler. The recommendation pipeline is comparable. Netflix adds encoding-ladder + DRM-key-rotation complexity that Spotify can mostly skip.
- Should I cover the recommendation algorithm in detail?
- Mention collaborative filtering + embedding-based nearest-neighbor + offline batch recompute. Drawing the actual ALS or matrix-factorization math is overkill in 45 minutes. The signal is 'I know it's batch-precomputed and read live as a key-value lookup,' not 'I can derive the gradient.'