Design a Recommendation System — System Design Interview Guide
Design a Recommendation System is a system-design interview that asks you to build the engine behind a personalized feed — products on a marketplace, videos on a streaming service, posts on a social platform. The hard part is the two-stage architecture: candidate generation across billions of items, then ranking the top thousand with a heavier model, all under ~100ms p99 per user request.
By Alex Chen, Founder, InterviewChamp.AI · Last verified
Reported in interviews at
- Meta
- Amazon
- TikTok
- Netflix
Sourced from Glassdoor, Levels.fyi, and Blind interview reports.
Functional requirements
- Return a personalized list of N items (videos, products, posts) for a given user and surface
- Update recommendations as users interact (clicks, watches, likes, skips) within a session
- Cold-start new users (no interaction history) and new items (no engagement signal)
- Honor business rules: blocklists, diversity constraints, sponsored slots, regional availability
- Surface trending or breakout items even if they're outside the user's typical interest cluster
- Provide explainability: why was this item recommended (debugging, regulatory requirements)
Non-functional requirements
- Latency: <100ms p99 from user-feed request to ranked list returned
- Throughput: ~1M+ feed requests/sec at peak across the platform
- Freshness: new user interaction reflected in next request within ~seconds (session-level), new items recommendable within ~minutes of upload
- Catalog size: ~1B+ items in candidate pool
- User base: ~1B+ active users
- Quality: recommendation engagement rate (click-through, watch-completion) as the north-star metric, A/B testable continuously
Capacity estimation
Public-scale assumption: ~1B daily active users across a major platform, ~5 feed requests per user per day (one per session entry), so ~5B feed requests/day = ~60K/sec average, ~1M/sec at peak. Each request returns 50-200 items.
Catalog: ~1B items across video, products, or posts. Per-item features (embedding vectors of dimension ~200, plus metadata like creator, language, age) at ~2 KB each = ~2 TB total catalog feature store. Loaded into in-memory index for retrieval.
User feature store: ~1B users × ~5 KB per user (embedding + recent-interaction history + demographics + computed features) = ~5 TB. Sharded by user_id in a key-value store.
Candidate generation: each request needs to surface ~1000 candidates from 1B items in <30ms. Brute-force scoring 1B items per request is impossible (would need ~30B operations/sec). The architecture uses approximate-nearest-neighbor (ANN) indexes (HNSW, IVF, ScaNN-style) over user-embedding space: convert user → query vector, retrieve top-K similar item vectors in ~10ms.
Ranker: top ~1000 candidates from candidate generation get scored by a heavier ranker model (~50ms total budget for 1000 items at ~50μs each, batched on GPU). Output is a per-item score; final feed is the top N after diversity post-processing.
Model training: the ranker is retrained daily on the last 30 days of interaction logs. With 1B users × 5 sessions × 100 items shown × 1 KB per interaction event = ~500 GB/day of training data. Stored in a columnar warehouse for the offline training pipeline.
High-level design
Two-stage architecture: candidate generation (recall-oriented, fast) and ranking (precision-oriented, slower per item but on a smaller set).
Candidate generation: from 1B items, retrieve ~1000-10000 candidates that are plausibly relevant. Multiple parallel retrieval sources contribute candidates, each with a different generation strategy. (1) Two-tower model: a user-tower and item-tower trained jointly produce embeddings such that high-engagement (user, item) pairs are close in vector space. At query time, the user vector is computed online from the user's recent history, then an ANN index over item vectors returns the top-K closest items. (2) Collaborative-filtering candidates: items that users similar to this user engaged with recently. (3) Recency candidates: trending items in the user's region and language. (4) Creator-affinity candidates: items from creators the user follows or has previously engaged with. (5) Diversity candidates: items from interest clusters underrepresented in the user's recent history.
Each generator returns its top-K with a candidate-source tag. The candidates are deduplicated and forwarded to the ranker.
Ranker: a deeper model (gradient-boosted trees, deep neural network, or transformer-based) takes the candidate list plus a richer feature set per (user, item) pair and outputs a calibrated engagement-probability score. Features include: user embedding, item embedding, recent-interaction history, item age, item creator's average engagement, time-of-day, device type, and dozens of computed cross-features (e.g., 'how often has this user engaged with items in this category in the last 7 days'). The ranker is the most expensive layer per item, which is why candidate generation must aggressively prune.
Post-processing: the ranker's top scores aren't the final feed. A diversity layer enforces constraints — no more than N items from the same creator, no more than M items from the same category in the top 20, sponsored slots interleaved at predetermined positions, blocklist filtering (user-blocked creators, age-inappropriate content). The post-processed list is the final feed.
Feature serving: features come from a feature store with both batch (offline-computed user-aggregate features updated daily) and streaming (real-time features like 'item viewed in last 60 seconds') paths. Online feature retrieval is part of the request latency budget — typically ~10ms for the full feature fetch via a sharded in-memory key-value store.
Logging and feedback: every item shown is logged with (user_id, item_id, position, candidate_source, ranker_score, timestamp). User interactions on shown items (click, watch-duration, like, skip) are joined back to the impression log to produce training data. The full impression log volume is enormous; sampling and aggregation are standard.
Deep dive — the hard problem
Three deep dives: cold-start, the freshness vs personalization tradeoff, and exploration vs exploitation.
Cold-start: new users have no interaction history; new items have no engagement signal. Both cases break the standard embedding-similarity retrieval.
New user cold-start: the system uses demographic and registration-context signals (location, age range, device type, referral source, language) to assign the user to a 'starter persona' cluster, and seeds early recommendations from that cluster's popular items. As the user interacts with even 5-10 items, a partial user embedding becomes computable and the system transitions to personalized retrieval. Mature systems also use survey-style onboarding ('pick 3 interests') to bootstrap the embedding faster.
New item cold-start: a brand-new item has no engagement history, so collaborative-filtering won't surface it. The two-tower model handles this through content features — the item-tower takes content embeddings (image features, title text, creator embedding, category) as input, not just collaborative signal. A new item from a popular creator with similar content features to past hits gets a reasonable initial score and can be surfaced to a fraction of users; their engagement signal then bootstraps the item into mainstream candidate generation. This is exploration: deliberately showing low-data items to gather signal.
Freshness vs personalization tradeoff: a strongly personalized system over-fits to the user's existing interests — they see more of what they've already seen, the filter bubble. A purely recency-driven system shows everyone the same trending items regardless of fit. The mix is a tuned parameter: typically 60-80% personalized candidates, 10-20% recency candidates, 5-10% diversity candidates, 5-10% exploration candidates. The mix is A/B tested continuously; engagement metrics and longer-horizon retention metrics measure whether the user is being well-served or filter-bubbled.
Long-term vs short-term engagement: optimizing only for short-term engagement (the next click) is a known anti-pattern that selects for clickbait and outrage content. Production rankers blend a short-term engagement prediction with a long-term value model (will this user still be active in 30 days if shown this item). The long-term signal is harder to learn (sparse, delayed), so it gets blended into the ranker score with a weight tuned to balance immediate engagement against retention.
Exploration vs exploitation: pure exploitation always shows the highest-predicted-engagement item, which means low-data items never get shown and the catalog ossifies. Exploration deliberately reserves slots for items the model is uncertain about — multi-armed-bandit-style epsilon-greedy (e.g., 5% of slots show items chosen for high uncertainty rather than high predicted score) or Thompson sampling. The exploration budget is small but essential for surfacing new creators and new content.
Fourth tradeoff: ranker training cadence. A daily-retrained ranker captures yesterday's trends. Some platforms move to hourly retraining on streaming data for trend-sensitive surfaces (short-form video). The cost is operational complexity — model serving infrastructure must hot-swap models without downtime, and online evaluation must catch a bad model before it serves real users. Production systems use shadow deployment: a new model serves traffic in parallel with the current model, its predictions are logged but not shown to users, and engagement-rate comparison decides whether to promote.
Fifth: business-rule overrides. Sponsored slots, regulatory blocks (age-restricted content in regions that ban it), and policy-violation flagged content all bypass the ranker score. The post-processing layer enforces these as hard constraints — sponsored items go in predetermined positions, blocked items are filtered, and downranked content (e.g., creator on probation) has its score multiplied by a constant <1. Mention business-rule layers explicitly; interviewers expect candidates to recognize that the model isn't the final arbiter.
Common mistakes
- Trying to score all 1B items in the ranker — candidate generation must aggressively prune before ranking
- Forgetting cold-start — new users and new items break embedding-similarity retrieval and need a content-based fallback
- Ignoring exploration — pure exploitation ossifies the catalog and starves new content
- Optimizing only for short-term engagement — clickbait and outrage content win the next-click metric and destroy retention
- Skipping the diversity post-processing layer — ranker scores alone produce a feed dominated by one creator or one category
Likely follow-up questions
- How would you implement explainability — why was this item recommended to this user?
- What changes if the platform must support a 'private mode' where the user's interactions aren't used for personalization?
- How would you A/B test a new ranker model without exposing users to a worse experience during the test?
- How would you handle a coordinated effort to inflate an item's engagement (bot views, like-farms) so it surfaces in candidate generation?
- How would you support multi-objective optimization — maximize engagement and creator-diversity and revenue simultaneously?
Practice Design a Recommendation System live with an AI interviewer
Free, no sign-up required. Get real-time feedback on your design.
Practice these liveFrequently asked questions
- How long is the Design Recommendation System interview?
- 60-75 minutes is typical at recommendation-heavy companies (Meta, Google, Amazon, streaming platforms). Expect deep questions on candidate generation strategies, the ranker architecture, and cold-start.
- Do I need to know specific ML model details (deep nets vs gradient boosting)?
- Naming the two-stage retrieve-then-rank architecture, explaining the embedding-based ANN retrieval, and discussing the feature set is enough. The model-architecture choice (DNN vs GBDT vs transformer) is a tunable detail and varies by platform.
- How is this different from a search system?
- Search is query-driven: the user provides explicit intent. Recommendation is interest-driven: the system infers intent from history. The candidate-generation step is similar (both use ANN retrieval), but recommendation has a heavier focus on diversity, exploration, and personalization — and search has no user-embedding input.
- What is the most important concept for Design Recommendation System?
- The two-stage architecture: candidate generation prunes 1B items to ~1000 with cheap retrieval, then a heavy ranker scores the ~1000 with rich features. The senior signal is recognizing the latency budget forces this decomposition and explaining the candidate-source mix (collab + content + recency + diversity + exploration).