How long is the Design Recommendation System interview?

60-75 minutes is typical at recommendation-heavy companies (Meta, Google, Amazon, streaming platforms). Expect deep questions on candidate generation strategies, the ranker architecture, and cold-start.

Do I need to know specific ML model details (deep nets vs gradient boosting)?

Naming the two-stage retrieve-then-rank architecture, explaining the embedding-based ANN retrieval, and discussing the feature set is enough. The model-architecture choice (DNN vs GBDT vs transformer) is a tunable detail and varies by platform.

How is this different from a search system?

Search is query-driven: the user provides explicit intent. Recommendation is interest-driven: the system infers intent from history. The candidate-generation step is similar (both use ANN retrieval), but recommendation has a heavier focus on diversity, exploration, and personalization — and search has no user-embedding input.

What is the most important concept for Design Recommendation System?

The two-stage architecture: candidate generation prunes 1B items to ~1000 with cheap retrieval, then a heavy ranker scores the ~1000 with rich features. The senior signal is recognizing the latency budget forces this decomposition and explaining the candidate-source mix (collab + content + recency + diversity + exploration).

System Design Questions

Design a Recommendation System — System Design Interview Guide

Design a Recommendation System is a system-design interview that asks you to build the engine behind a personalized feed — products on a marketplace, videos on a streaming service, posts on a social platform. The hard part is the two-stage architecture: candidate generation across billions of items, then ranking the top thousand with a heavier model, all under ~100ms p99 per user request.

By Sam K., Founder, InterviewChamp.AI · Last verified 2026-05-25

Reported in interviews at

Meta
Google
Amazon
TikTok
Netflix

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

Return a personalized list of N items (videos, products, posts) for a given user and surface
Update recommendations as users interact (clicks, watches, likes, skips) within a session
Cold-start new users (no interaction history) and new items (no engagement signal)
Honor business rules: blocklists, diversity constraints, sponsored slots, regional availability
Surface trending or breakout items even if they're outside the user's typical interest cluster
Provide explainability: why was this item recommended (debugging, regulatory requirements)

Non-functional requirements

Latency: <100ms p99 from user-feed request to ranked list returned
Throughput: ~1M+ feed requests/sec at peak across the platform
Freshness: new user interaction reflected in next request within ~seconds (session-level), new items recommendable within ~minutes of upload
Catalog size: ~1B+ items in candidate pool
User base: ~1B+ active users
Quality: recommendation engagement rate (click-through, watch-completion) as the north-star metric, A/B testable continuously

Capacity estimation

Public-scale assumption: ~1B daily active users across a major platform, ~5 feed requests per user per day (one per session entry), so ~5B feed requests/day = ~60K/sec average, ~1M/sec at peak. Each request returns 50-200 items.

Catalog: ~1B items across video, products, or posts. Per-item features (embedding vectors of dimension ~200, plus metadata like creator, language, age) at ~2 KB each = ~2 TB total catalog feature store. Loaded into in-memory index for retrieval.

User feature store: ~1B users × ~5 KB per user (embedding + recent-interaction history + demographics + computed features) = ~5 TB. Sharded by user_id in a key-value store.

Candidate generation: each request needs to surface ~1000 candidates from 1B items in <30ms. Brute-force scoring 1B items per request is impossible (would need ~30B operations/sec). The architecture uses approximate-nearest-neighbor (ANN) indexes (HNSW, IVF, ScaNN-style) over user-embedding space: convert user → query vector, retrieve top-K similar item vectors in ~10ms.

Ranker: top ~1000 candidates from candidate generation get scored by a heavier ranker model (~50ms total budget for 1000 items at ~50μs each, batched on GPU). Output is a per-item score; final feed is the top N after diversity post-processing.

Model training: the ranker is retrained daily on the last 30 days of interaction logs. With 1B users × 5 sessions × 100 items shown × 1 KB per interaction event = ~500 GB/day of training data. Stored in a columnar warehouse for the offline training pipeline.

High-level design

Two-stage architecture: candidate generation (recall-oriented, fast) and ranking (precision-oriented, slower per item but on a smaller set).

Candidate generation: from 1B items, retrieve ~1000-10000 candidates that are plausibly relevant. Multiple parallel retrieval sources contribute candidates, each with a different generation strategy. (1) Two-tower model: a user-tower and item-tower trained jointly produce embeddings such that high-engagement (user, item) pairs are close in vector space. At query time, the user vector is computed online from the user's recent history, then an ANN index over item vectors returns the top-K closest items. (2) Collaborative-filtering candidates: items that users similar to this user engaged with recently. (3) Recency candidates: trending items in the user's region and language. (4) Creator-affinity candidates: items from creators the user follows or has previously engaged with. (5) Diversity candidates: items from interest clusters underrepresented in the user's recent history.

Each generator returns its top-K with a candidate-source tag. The candidates are deduplicated and forwarded to the ranker.

Ranker: a deeper model (gradient-boosted trees, deep neural network, or transformer-based) takes the candidate list plus a richer feature set per (user, item) pair and outputs a calibrated engagement-probability score. Features include: user embedding, item embedding, recent-interaction history, item age, item creator's average engagement, time-of-day, device type, and dozens of computed cross-features (e.g., 'how often has this user engaged with items in this category in the last 7 days'). The ranker is the most expensive layer per item, which is why candidate generation must aggressively prune.

Post-processing: the ranker's top scores aren't the final feed. A diversity layer enforces constraints — no more than N items from the same creator, no more than M items from the same category in the top 20, sponsored slots interleaved at predetermined positions, blocklist filtering (user-blocked creators, age-inappropriate content). The post-processed list is the final feed.

Feature serving: features come from a feature store with both batch (offline-computed user-aggregate features updated daily) and streaming (real-time features like 'item viewed in last 60 seconds') paths. Online feature retrieval is part of the request latency budget — typically ~10ms for the full feature fetch via a sharded in-memory key-value store.

Logging and feedback: every item shown is logged with (user_id, item_id, position, candidate_source, ranker_score, timestamp). User interactions on shown items (click, watch-duration, like, skip) are joined back to the impression log to produce training data. The full impression log volume is enormous; sampling and aggregation are standard.

Deep dive — the hard problem

Three deep dives: cold-start, the freshness vs personalization tradeoff, and exploration vs exploitation.

Cold-start: new users have no interaction history; new items have no engagement signal. Both cases break the standard embedding-similarity retrieval.

New user cold-start: the system uses demographic and registration-context signals (location, age range, device type, referral source, language) to assign the user to a 'starter persona' cluster, and seeds early recommendations from that cluster's popular items. As the user interacts with even 5-10 items, a partial user embedding becomes computable and the system transitions to personalized retrieval. Mature systems also use survey-style onboarding ('pick 3 interests') to bootstrap the embedding faster.

New item cold-start: a brand-new item has no engagement history, so collaborative-filtering won't surface it. The two-tower model handles this through content features — the item-tower takes content embeddings (image features, title text, creator embedding, category) as input, not just collaborative signal. A new item from a popular creator with similar content features to past hits gets a reasonable initial score and can be surfaced to a fraction of users; their engagement signal then bootstraps the item into mainstream candidate generation. This is exploration: deliberately showing low-data items to gather signal.

Freshness vs personalization tradeoff: a strongly personalized system over-fits to the user's existing interests — they see more of what they've already seen, the filter bubble. A purely recency-driven system shows everyone the same trending items regardless of fit. The mix is a tuned parameter: typically 60-80% personalized candidates, 10-20% recency candidates, 5-10% diversity candidates, 5-10% exploration candidates. The mix is A/B tested continuously; engagement metrics and longer-horizon retention metrics measure whether the user is being well-served or filter-bubbled.

Long-term vs short-term engagement: optimizing only for short-term engagement (the next click) is a known anti-pattern that selects for clickbait and outrage content. Production rankers blend a short-term engagement prediction with a long-term value model (will this user still be active in 30 days if shown this item). The long-term signal is harder to learn (sparse, delayed), so it gets blended into the ranker score with a weight tuned to balance immediate engagement against retention.

Exploration vs exploitation: pure exploitation always shows the highest-predicted-engagement item, which means low-data items never get shown and the catalog ossifies. Exploration deliberately reserves slots for items the model is uncertain about — multi-armed-bandit-style epsilon-greedy (e.g., 5% of slots show items chosen for high uncertainty rather than high predicted score) or Thompson sampling. The exploration budget is small but essential for surfacing new creators and new content.

Fourth tradeoff: ranker training cadence. A daily-retrained ranker captures yesterday's trends. Some platforms move to hourly retraining on streaming data for trend-sensitive surfaces (short-form video). The cost is operational complexity — model serving infrastructure must hot-swap models without downtime, and online evaluation must catch a bad model before it serves real users. Production systems use shadow deployment: a new model serves traffic in parallel with the current model, its predictions are logged but not shown to users, and engagement-rate comparison decides whether to promote.

Fifth: business-rule overrides. Sponsored slots, regulatory blocks (age-restricted content in regions that ban it), and policy-violation flagged content all bypass the ranker score. The post-processing layer enforces these as hard constraints — sponsored items go in predetermined positions, blocked items are filtered, and downranked content (e.g., creator on probation) has its score multiplied by a constant <1. Mention business-rule layers explicitly; interviewers expect candidates to recognize that the model isn't the final arbiter.

Common mistakes

Trying to score all 1B items in the ranker — candidate generation must aggressively prune before ranking
Forgetting cold-start — new users and new items break embedding-similarity retrieval and need a content-based fallback
Ignoring exploration — pure exploitation ossifies the catalog and starves new content
Optimizing only for short-term engagement — clickbait and outrage content win the next-click metric and destroy retention
Skipping the diversity post-processing layer — ranker scores alone produce a feed dominated by one creator or one category

Likely follow-up questions

How would you implement explainability — why was this item recommended to this user?
What changes if the platform must support a 'private mode' where the user's interactions aren't used for personalization?
How would you A/B test a new ranker model without exposing users to a worse experience during the test?
How would you handle a coordinated effort to inflate an item's engagement (bot views, like-farms) so it surfaces in candidate generation?
How would you support multi-objective optimization — maximize engagement and creator-diversity and revenue simultaneously?

Related system design scenarios

Frequently asked questions

How long is the Design Recommendation System interview?: 60-75 minutes is typical at recommendation-heavy companies (Meta, Google, Amazon, streaming platforms). Expect deep questions on candidate generation strategies, the ranker architecture, and cold-start.
Do I need to know specific ML model details (deep nets vs gradient boosting)?: Naming the two-stage retrieve-then-rank architecture, explaining the embedding-based ANN retrieval, and discussing the feature set is enough. The model-architecture choice (DNN vs GBDT vs transformer) is a tunable detail and varies by platform.
How is this different from a search system?: Search is query-driven: the user provides explicit intent. Recommendation is interest-driven: the system infers intent from history. The candidate-generation step is similar (both use ANN retrieval), but recommendation has a heavier focus on diversity, exploration, and personalization — and search has no user-embedding input.
What is the most important concept for Design Recommendation System?: The two-stage architecture: candidate generation prunes 1B items to ~1000 with cheap retrieval, then a heavy ranker scores the ~1000 with rich features. The senior signal is recognizing the latency budget forces this decomposition and explaining the candidate-source mix (collab + content + recency + diversity + exploration).