Why is candidate generation a separate stage from ranking?

Cost. The ranker is expensive — 50 microseconds per (user, item) pair × 10K candidates is 500ms, which is the entire latency budget. Candidate generation is cheap (vector lookups, precomputed lists) and reduces the pool to ~500 items so the ranker only scores what matters. Same pattern as search retrieval-then-ranking.

Is the feature store the same as a regular cache?

Architecturally similar (key-value, in-memory) but operationally distinct. The feature store has multiple write paths (batch, streaming, sync), a freshness SLA per feature, and a strict point-in-time-consistency requirement between serving and training. A regular cache doesn't have any of these.

Do I need to discuss the ML model architecture?

Mention the family (neural net or tree ensemble), the input-output shape (user_features + item_features → engagement score), and the multi-objective nature (likes, comments, dwell, not-hide). Drilling into the network architecture is out of scope; the system question is about feature serving and the train-serve pipeline.

What is the senior signal for this question?

Three: (1) you separate candidate generation from ranking with a clear cost/quality tradeoff; (2) you architect the feature store as a first-class subsystem with point-in-time consistency; (3) you describe the online-offline split — training on logged labels, serving with shadow tests and gradual rollout. Missing the offline-online distinction is the most common hire-reduce signal.

System Design Questions

Design a News Feed Ranking System — System Design Interview Guide

Design a News Feed Ranking System is a system-design interview that asks you to build the personalized ranking and serving layer behind a social feed: hundreds of millions of users open the app, and within 500ms they see a ranked list of 50 stories tailored to them by an ML model. The hard part is candidate generation at scale (which 500 items to rank from a pool of millions), the feature store that hydrates user + item features for the ranker, and cache warmup so the cold-start request doesn't blow the latency budget.

By Sam K., Founder, InterviewChamp.AI · Last verified 2026-05-25

Reported in interviews at

Meta
ByteDance
Snapchat
Pinterest
LinkedIn

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

Generate a personalized ranked feed of 50 items for a given user on feed open
Re-rank on subsequent scrolls (paginate through ranked candidates)
Refresh the feed when the user pulls-to-refresh (newer items + rerank)
Filter out items the user has already seen (within a recency window)
Honor per-user preferences (muted topics, blocked authors)
Mix multiple content sources: posts from followees, sponsored items, recommended (non-follow) content

Non-functional requirements

Scale: 500M DAU, ~5 feed loads per user per day = 2.5B feed requests/day, peak ~80K/sec
Feed-load latency: <500ms p95 from request to ranked top-N returned
Freshness: a post published within the last 5 minutes should be a candidate for the next feed load
Availability: 99.99%; on ranker failure, fall back to a recency-only feed
Model serving: <50ms p99 for scoring 500 candidates against a user
Personalization: ranker considers ~200 features per (user, item) pair

Capacity estimation

Public assumptions: 500M DAU, 5 feed loads per user per day, 80K requests/sec peak. Each request scores ~500 candidates and returns the top 50. Aggregate scoring: 500 × 80K = 40M scores per second peak.

Candidate pool: for a typical user with 300 follows, the recent-posts pool (followee posts in the last 48 hours) is ~5,000 items. Adding recommended (non-follow) candidates pushes the pool to ~10,000. Candidate generation must narrow this 10K to ~500 in <100ms — full ML scoring on 10K is too expensive.

Feature store: 200 features per (user, item) pair, 500 candidates per request, 80K requests/sec = 8B feature lookups/sec aggregate. Realistically the features are organized as ~50 user features and ~150 item features, fetched once per request (user features) and once per candidate (item features). Net lookups: 80K × (1 + 500) = ~40M feature-row reads/sec. Sharded in-memory key-value store with sub-millisecond p99.

Feature storage: 200M active users × 50 features × 50 bytes/feature = 500 GB for user features. 1B item features × 150 × 50 bytes = 7.5 TB for item features. Item features have a long tail (old posts are rarely scored); active hot set is ~100M items × ~3 GB at any time.

Ranker model: a neural net with ~10M parameters (or a tree ensemble with thousands of trees). Per-candidate scoring cost: ~50 microseconds on a CPU, faster on a GPU. Scoring 500 candidates per request: 25ms p50, 50ms p99. The ranker is served from a dedicated inference fleet, not co-located with the API tier.

Freshness: a post created at T should appear in feed loads at T + 5 minutes. The candidate generation index must update within 5 minutes of post creation. Item features (engagement velocity, content quality) update on a 1-minute aggregation window.

Seen-content filter: 500M users × ~500 items seen per day × 30-day retention × 16 bytes per (user_id, item_id) = ~15 TB. Probabilistic data structures (Bloom filters, count-min sketch) compress this by 10-50x with tolerable false-positive rates — a user occasionally seeing a duplicate is acceptable.

High-level design

Three-stage ranking pipeline: candidate generation → ranking → policy.

Stage 1: candidate generation. Reduces the pool from ~10K eligible items to ~500 candidates that go into the ranker. Multiple parallel candidate sources contribute: (a) followee-recents — posts from followees in the last 48 hours, pulled from a precomputed timeline cache; (b) recommended-similar — items similar to those the user has engaged with, served from a vector-similarity index; (c) trending — high-engagement items in the user's region/language/topic affinities; (d) sponsored — ad candidates pre-matched to the user's targeting. Each source returns a few hundred items with a candidate-source score; the union is deduplicated and passed to the ranker.

Stage 2: ranking. The ranker is a neural net (or tree ensemble) that takes (user_features, item_features, interaction_features) → predicted engagement score per candidate. The 500 candidates are scored in parallel, sorted by score, and the top ~100 emerge. Engagement is multi-objective: not just 'will the user click' but 'will they like, comment, dwell-time, not-hide, not-report'. The model is trained offline on logged engagement data and served online from an inference fleet.

Stage 3: policy. Applies business rules and diversity constraints to the ranker's top-100: (a) filter seen items (last 30 days); (b) enforce diversity (no more than 3 consecutive items from the same author, mix content types); (c) honor per-user mutes and blocks; (d) inject sponsored items at fixed slots (e.g., positions 3, 8, 14) with their own ad-policy budget; (e) cap repeated topics. The final 50 items are returned to the client.

Feature store sits to the side of the pipeline. It holds user features (long-term: account age, follow graph stats, topic affinities; short-term: last 24h engagement signals) and item features (content type, language, author reputation, engagement velocity, age). The store has two tiers: a hot in-memory layer for active users and recent items, and a cold persistent layer for everything else. The ranker hydrates features in a single batched read per request.

Online-offline split. Training is offline: a daily batch job reads logged feed-load events with engagement labels, retrains the model, validates on a holdout, ships to a model registry. Serving is online: the latest model from the registry is loaded by the inference fleet on a periodic refresh (every 1-6 hours). New features are added in two stages: log the feature for a few weeks (no ranking effect), retrain with the feature included, ship. This decoupling lets model iteration move fast without coordinating online code changes.

Deep dive — the hard problem

Three deep dives: candidate generation and the ranker model serving, the feature store architecture, and cache warmup for cold-start requests.

Candidate generation. The ranker is too expensive to run on 10K candidates per request — at 50 microseconds per score × 10K = 500ms just for scoring, blowing the latency budget. The candidate generator must be cheap (sub-10ms) and produce a high-recall shortlist where the items the user actually wants to see are near the top.

Four candidate sources, each optimized differently.

Source A: followee-recents. The user's follow graph is precomputed (followee list). For each followee, recent posts are pulled from a precomputed per-user timeline cache (the same cache as in the Design Twitter / Design Instagram hybrid push-pull model). Cost: O(F × P) where F is followee count and P is recent posts per followee. Capped at top-K per followee to keep the candidate count bounded.

Source B: recommended-similar. The user has a learned embedding (derived from their engagement history). The candidate generator looks up the K nearest items in a vector-similarity index. This index is built offline from item embeddings (each item has a learned vector from its content + metadata). Query cost: O(log N) on an approximate-nearest-neighbor index (HNSW, IVF, or similar).

Source C: trending. Items with high engagement velocity in the last few minutes. Precomputed in a streaming aggregation tier — incoming engagement events update a per-region/per-topic top-K list every minute. Query cost: O(1) lookup.

Source D: sponsored. Ad candidates that match the user's targeting (demographics, interests, behavioral signals). Pulled from the ad-serving subsystem with its own ranking; treated as a separate slot in the policy stage.

The candidate set is deduplicated, given source-attribution scores, and passed to the ranker. The ranker doesn't need to know which source an item came from, but the source signal can be a feature.

Feature store architecture. The store has three things going on: storage layout, freshness, and online-offline consistency.

Storage: features are stored as (entity_id, feature_name) → value, where entity_id is user_id or item_id. Sharded by entity_id (so all features for one entity hit one shard). The hot tier is in-memory across shards; the cold tier is on SSD. Eviction from hot is LRU.

Freshness: features have varying update cadences. Some are static (account creation date — never changes). Some are slow-moving (follow count — updates per follow event). Some are fast-moving (last-5-minutes engagement counts — needs streaming ingest). The feature store's write path comes from multiple sources: batch jobs (daily refresh of slow-moving features), streaming pipelines (incoming engagement events → fast-moving feature updates), and synchronous writes from the API tier (user just clicked something → bump their engagement-count feature). Each feature has a defined freshness SLA.

Online-offline consistency. The feature values the model trained on must match what it sees at serving time, or you get train-serve skew (the model's behavior in production differs from its offline-evaluated behavior). The standard pattern: log every feature value used in a serving request to the same store that batch training reads from. At training time, the training job reads the logged feature values at that exact request's timestamp — point-in-time correct features. This avoids the trap where the feature value changed between serving and training and the model learns the wrong signal.

Cache warmup for cold-start. When a user opens the app, the system needs to assemble user features, candidate set, and item features — all within 500ms. A naive design fetches on-demand, accumulating latency. Production tactics.

Tactic 1: precomputed timeline cache (same as Twitter/Instagram). The user's followee-recents are precomputed and cached at the user-id level. Feed open: fetch the precomputed list (1 read), it's already there.

Tactic 2: user-feature warm cache. Active users' features are kept in the hot tier. The first feed load of the day for a returning user pulls features from the cold tier (50ms penalty) and promotes them to hot; subsequent loads in the session are <5ms.

Tactic 3: anticipatory prefetch. Background jobs identify users likely to open the app in the next hour (based on historical patterns) and pre-warm their features and candidate set. Hit rate is moderate but the latency win on hits is substantial.

Tactic 4: fallback to recency-only. If any stage fails (ranker fleet degraded, feature store partial outage), fall back to a chronological feed of followee-recents. Users still see something rather than nothing. The fallback is wired into the API tier as a circuit breaker — any stage error within X ms triggers the fallback path.

Fourth surface: model freshness and shipping cadence. The ranker model trains daily on a rolling window of engagement logs. New model candidates are evaluated offline (replay on a holdout day, check that key metrics improve) and shadow-tested online (the new model scores requests but the old model's output is returned; engagement deltas are estimated). Promotion is gradual: 1% traffic, then 10%, then 50%, then 100%, with automatic rollback on any monitored metric regression. Mention this as the model-ops surface — interviewers reward the candidate who treats the model as an evolving artifact, not a fixed asset.

Common mistakes

Running the ML ranker on every candidate in the pool — at 10K candidates per request the latency budget blows in scoring alone
Skipping candidate generation as a distinct stage — interviewers want a clean (cheap-generate, expensive-rank) split
Treating the feature store as a queryable database — feature serves have to be sub-millisecond, which means in-memory with batched reads, not SQL
Forgetting train-serve skew — features at training time and serving time must be point-in-time consistent
No fallback path on ranker failure — a degraded ML tier should drop the feed to recency-only, not return an empty page

Likely follow-up questions

How would you A/B-test a new ranker model against the existing one without exposing risky regressions to too many users?
What changes if you have to incorporate real-time signals (the user just liked something 30 seconds ago) into the next feed load?
How would you handle a 'cold-start' user who just signed up and has no engagement history?
How would you detect that the ranker is over-fitting to clickbait (high CTR but low dwell-time and high not-interested rate)?
How would you support a 'why am I seeing this' explainer feature that shows the user why a specific item was ranked highly?

Related system design scenarios

Frequently asked questions

Why is candidate generation a separate stage from ranking?: Cost. The ranker is expensive — 50 microseconds per (user, item) pair × 10K candidates is 500ms, which is the entire latency budget. Candidate generation is cheap (vector lookups, precomputed lists) and reduces the pool to ~500 items so the ranker only scores what matters. Same pattern as search retrieval-then-ranking.
Is the feature store the same as a regular cache?: Architecturally similar (key-value, in-memory) but operationally distinct. The feature store has multiple write paths (batch, streaming, sync), a freshness SLA per feature, and a strict point-in-time-consistency requirement between serving and training. A regular cache doesn't have any of these.
Do I need to discuss the ML model architecture?: Mention the family (neural net or tree ensemble), the input-output shape (user_features + item_features → engagement score), and the multi-objective nature (likes, comments, dwell, not-hide). Drilling into the network architecture is out of scope; the system question is about feature serving and the train-serve pipeline.
What is the senior signal for this question?: Three: (1) you separate candidate generation from ranking with a clear cost/quality tradeoff; (2) you architect the feature store as a first-class subsystem with point-in-time consistency; (3) you describe the online-offline split — training on logged labels, serving with shadow tests and gradual rollout. Missing the offline-online distinction is the most common hire-reduce signal.