Design YouTube — System Design Interview Guide
Design YouTube is a system-design interview that asks you to build a video-sharing platform: anyone uploads, anyone watches, and the recommendation feed surfaces relevant content from a 500-hour-per-minute upload firehose. The hard part is upload ingest, multi-bitrate encoding at scale, and recommendations.
By Alex Chen, Founder, InterviewChamp.AI · Last verified
Reported in interviews at
- Meta
- Amazon
- Netflix
- TikTok
Sourced from Glassdoor, Levels.fyi, and Blind interview reports.
Functional requirements
- Upload a video of arbitrary length (seconds to hours)
- Watch a video at adaptive bitrates, with seek and playback controls
- Search videos by title, description, tags, and channel
- Get personalized recommendations on the home page and after-video
- Comment, like/dislike, subscribe to a channel
Non-functional requirements
- Playback startup: <2 seconds from tap-to-play (p95)
- Upload throughput: 500 hours of video uploaded per minute globally
- Watch concurrency: ~200M concurrent viewers at peak
- Availability: 99.99%; degraded quality acceptable, full outage is not
Capacity estimation
Public scale (2023): ~2.7B monthly users, ~500 hours of video uploaded per minute = ~30K hours/hour = ~700K hours/day of new content. At average bitrate ~2 Mbps for the raw upload, daily upload byte volume is 700K × 3600 × 2 Mb / 8 = ~630 TB/day of original uploads. Each upload is encoded into ~30 variants (resolutions × codecs × audio tracks), inflating storage by ~10× post-encoding to ~6 PB/day of encoded artifacts.
Watch egress: ~1B hours of video watched/day. At blended ~3 Mbps average bitrate, egress is 1B × 3600 × 3 / 8 ≈ 1.3 EB/day. Peak concurrent viewers ~200M; at 3 Mbps average, peak egress is ~600 Tbps — only achievable via a global CDN with edge presence in most major ISPs.
Metadata: ~10B total videos × ~2 KB metadata = ~20 TB; modest compared to media bytes. Comments: 1B videos × avg ~10 comments × ~200 bytes = ~2 TB/year of comments, growing.
High-level design
Three planes: control (metadata, search, recommendations), ingest (upload and encode), data (delivery via CDN).
Upload + ingest: the client requests a resumable upload URL; the upload service writes the original video to object storage as it arrives, in chunks (so a network interruption resumes from the last received chunk, not from byte zero). On upload completion, an event drops into an encoding queue. A fleet of encoding workers consumes the queue and transcodes each video into the standard ladder: ~6 resolutions (144p, 240p, 360p, 480p, 720p, 1080p, plus 4K for sources that warrant it) × ~3 codecs (H.264, H.265, AV1) plus separate audio tracks. Each variant is segmented into ~6-second chunks and uploaded to object storage; manifests (DASH/HLS) describe the available chunks. The video is then marked 'published' in metadata. Encoding is the heaviest compute load in the system: 500 hours/minute uploaded means hundreds of thousands of compute cores doing nothing but transcoding 24/7.
Watch path: client hits the watch service for video metadata and manifest URL. The manifest is fetched from a CDN edge cache; chunks are then fetched as the player needs them, with adaptive bitrate selection per chunk. The CDN is multi-tier: edge caches inside ISPs hold the most popular chunks, regional caches hold the long tail, and origin object storage holds the cold archive. A popular video is positioned to edges within minutes of becoming popular; a niche video is served from regional cache or origin.
Control plane: search uses an inverted index over title, description, and channel; ranking blends text relevance with engagement signals. Recommendations run as a two-stage pipeline (candidate generation + ranking — see deep dive). Comments and likes live in sharded relational stores keyed by video_id (so all comments for a video live on one shard).
Deep dive — the hard problem
Three deep dives: ingest pipeline elasticity, CDN tier organization, and the recommendation pipeline.
Ingest elasticity: upload rate spikes 10× during regional prime times and after major events. The encoding queue is the elastic component — it absorbs bursts by accumulating depth, while the worker fleet auto-scales on queue length. Tradeoff: longer queue = longer time-to-published; shorter queue = wasteful idle workers. The actual production answer: prioritize encoding by predicted view count. A creator with a billion subscribers gets their upload encoded ahead of an anonymous user, because the head-of-line latency directly impacts visible product quality. Mention this priority-aware queue as the way you'd handle the head-of-line problem.
CDN tier organization: the Pareto problem is severe — top 0.01% of videos serve ~30% of view hours. Caching strategy: edge caches near viewers hold this 'hot 0.01%' explicitly; regional caches hold the next 1%; cold object-storage origin serves the rest. Eviction is LFU with a recency boost so a newly-trending video can race up the tiers in hours. Predicting hotness for prewarming is partly engagement-signal-driven (likes/min, comment velocity) and partly creator-driven (channel subscriber count, viral score). Custom CDN appliances inside ISPs (same model as Design Netflix) keep last-mile bandwidth manageable; for non-ISP regions, third-party CDNs absorb overflow.
Recommendations: too rich to design end-to-end in 45 minutes, but the two-stage pattern is the answer expected. Stage 1 (candidate generation): from a corpus of billions of videos, narrow to ~500 candidates relevant to a user. Done with cheap heuristics — subscribed channels, similar-watched videos, recent uploads, trending in user's region — plus a learned embedding retrieval (user vector × video vector cosine similarity, computed offline, served from a vector index). Stage 2 (ranking): score the 500 candidates with a heavyweight model that takes (user_features, video_features, context) → predicted watch-time-weighted-with-feedback score. Sort, return top 20. Discuss the offline-train / online-serve split, mention the heavyweight-model serving latency budget (~50ms), and acknowledge the cold-start problem (new users, new videos).
Finally, idempotent upload. A 4 GB upload over a flaky mobile connection might retry the same chunk twice; the upload service must dedupe via a chunk-hash + upload-session-id. Same chunk hash + same session = treat as already received. This is small but interviewers love it as a robustness signal.
Common mistakes
- Designing single-resolution playback — modern viewers demand adaptive bitrate, missing it is a hire-reduce signal
- Encoding inline during upload (synchronous) instead of async on a queue — kills upload throughput and elasticity
- Treating the recommendation system as one monolithic model — the two-stage candidate-then-rank split is the expected answer
- Forgetting CDN tier structure — at ExaByte/day egress, a single-tier CDN melts
- Conflating control plane and data plane infrastructure — they scale 1000× apart and shouldn't share design
Likely follow-up questions
- How would your design support a live-stream concert watched by 50M concurrent users?
- What changes if you have to add real-time chat during video playback?
- How would you implement watch-time analytics that scale to 1B hours/day?
- How would you support an offline-download feature for mobile users on metered data?
- How would you detect and remove copyrighted content uploaded by users?
Practice Design YouTube live with an AI interviewer
Free, no sign-up required. Get real-time feedback on your design.
Practice these liveFrequently asked questions
- Is Design YouTube the same as Design Netflix?
- Related but distinct. Netflix is a curated catalog with pre-positioned content (~20K titles); YouTube is open-upload with ~700K hours of new content per day. Netflix optimizes for predictability; YouTube optimizes for ingest scale and discovery.
- How long is the Design YouTube interview at Google?
- 60 minutes is standard; senior rounds (L6+) push to 75 minutes and expect detailed coverage of recommendations and CDN tiers. Source: Glassdoor Google L5/L6 reports 2023–2024.
- Do I need to design the recommendation algorithm in detail?
- The two-stage candidate + ranking split is the expected answer. Drilling into specific ranking model architectures is bonus signal but not required unless interviewing for the recommendation team.
- What's the single most important deep dive in Design YouTube?
- The encoding pipeline. Upload-to-published latency is a make-or-break product property, and the priority-aware queue + auto-scaling worker fleet is the cleanest 'I understand elasticity' signal.