Skip to main content

System Design Questions

Design Typeahead Suggestion — System Design Interview Guide

Design Typeahead Suggestion is a system-design interview that asks you to build a personalized prefix-completion engine: as a user types, the system returns the top suggestions ranked not just by global popularity but tailored to the user's history, friend network, and context. The hard part is balancing personalization quality against the sub-100ms latency budget.

By Alex Chen, Founder, InterviewChamp.AI · Last verified

Reported in interviews at

  • Meta
  • Amazon
  • LinkedIn
  • Pinterest
  • Spotify

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

  • For each typed prefix, return the top N personalized completions
  • Rank completions by a blend of global popularity, recency, user-personal history, and contextual signals (location, time, current session)
  • Update suggestions in near-real-time as the user types each character
  • Support multiple entity types in one response: queries, users, products, places (depending on platform)
  • Filter inappropriate content and apply platform-specific quality rules
  • Handle typos and approximate prefix matching

Non-functional requirements

  • Latency: <100ms p99 from keypress to suggestions visible (network + compute)
  • Personalization: the user's own past queries and recent activity must influence top results
  • Read-heavy: ~50K+ typeahead requests/sec/region peak (each typed character potentially triggers a request)
  • Freshness: user's most-recent activity should influence suggestions within minutes
  • Scale: ~500M users, ~5B+ typeahead requests/day globally

Capacity estimation

Scale anchors for a major social or e-commerce platform: ~500M MAU, average ~20 search sessions/user/month, average 3-5 typed characters per session triggering 3-5 typeahead requests. Total typeahead requests: ~30B/month = ~12K/sec average, ~50K/sec peak globally.

Per-region peak: with traffic distributed across 5 major regions, each region sees ~10K-50K typeahead requests/sec peak.

Corpus size: the personalization layer combines a global candidate pool (top 1-10M popular queries) with per-user candidate pools (each user's past 1000 queries). Global pool: ~10M queries × 100 bytes = 1 GB; per-user pool: ~1000 queries × 100 bytes × 500M users = 50 TB total but accessed only per-user.

Storage breakdown. Global trie: ~1 GB per region (replicated). Per-user history: ~50 KB per user, sharded by user_id across many nodes. User embedding (for personalization): a few hundred float values per user, ~2 KB/user × 500M = 1 TB across all users.

Write QPS: each user search emits an event used for personalization. ~30B events/month = ~12K/sec. Updates flow into per-user history (recent queries) and into the global aggregation pipeline (popularity counters).

Latency budget: 100ms total budget breaks down as ~30ms network + ~30ms global candidate retrieval + ~20ms personalization scoring + ~20ms response assembly. Every component must stay sharply within budget.

High-level design

Three serving layers: global candidates, personalization, and ranking. Plus an offline aggregation pipeline feeding all three.

Global candidates: an in-memory prefix structure (compressed trie or DAWG) holding the top ~1-10M global queries per region. Each terminal node precomputes the top-K completions for the prefix. On a typeahead request, the service traverses the trie to the user's prefix and reads the precomputed top-K candidates — O(prefix length) in trie traversal, O(1) to read the precomputed list. Latency target: <30ms.

Personalization layer: every user has a personal candidate set built from their recent queries and activity. The personal candidate set lives in a sharded store keyed by user_id and is fetched on each request in parallel with the global candidate lookup. The personal set is small (typically <100 entries) so the fetch is sub-10ms.

The personal candidates include the user's own past queries with prefixes matching the current input. They also include entities the user has recently engaged with (people the user messaged, items the user viewed) that share a prefix with the input. The personal candidate pool is much smaller than the global pool but much more relevant for the user.

Ranking layer: combines global and personal candidates into a final ranked list. The ranker scores each candidate using a learned model with features: global popularity, personal affinity (have they engaged with this entity before), recency, contextual signals (location, time-of-day, current session topic), and entity-type weights (queries vs users vs products). The top N candidates from the combined pool are returned.

The ranker runs online per request, so it must be fast (<20ms). Production systems use a small ranking model (gradient-boosted trees or a shallow neural network) with cached features rather than a deep model that would blow the latency budget.

Offline aggregation pipeline: every search event flows into a streaming pipeline. Aggregations include global popularity counters (which queries are trending), per-user history updates, and per-user embedding refreshes. The global trie is rebuilt periodically (every few minutes for trending-sensitive features, every hour for stable popularity); user-specific data is updated continuously.

Third optional layer: friend-graph personalization. For social platforms, suggestions can include entities popular among the user's friends or followed accounts. Friend popularity is precomputed offline per user (aggregating recent activity across the user's social graph) and stored as additional candidate input.

Deep dive — the hard problem

Two deep dives: personalization without blowing the latency budget, and the cold-start problem for new users.

Personalization within budget: the naive approach is to score every global candidate per user — but with ~50 candidates per global prefix lookup and a model scoring at ~1ms/candidate, that's 50ms just for ranking. The latency budget doesn't allow it.

Two tactics combine.

Candidate pruning: only score the top ~20 candidates from the combined global+personal pool. The global lookup already returns precomputed top-K (typically K=10-50); the personal lookup returns <100. After deduplication, the ranker scores at most ~50 unique candidates, well within budget.

Feature caching: features for each candidate (global popularity score, content type, entity metadata) are precomputed and cached. Features for each user (user embedding, recent engagement vector) are cached per user with hourly refresh. The online ranker just combines cached features — no expensive lookups in the request path. Mention feature caching explicitly — interviewers reward the candidate who recognizes the latency vs personalization tradeoff.

For very-low-latency cases (sub-30ms requirements), some platforms skip the full ranker and use a simple linear scoring (global_popularity × global_weight + personal_match × personal_weight) computed in microseconds. The full ranker activates only when latency budget allows.

Cold-start: a brand-new user with no history has no personal candidate set. The system falls back to global suggestions only, weighted toward defaults that work for everyone (popular trending queries, popular entities in the user's region). After the user issues their first few searches, the personal candidate set begins to populate and personalization kicks in.

A more sophisticated cold-start uses demographic and contextual proxies: a new user's geography, device, and inferred age bracket can suggest a 'cluster' of users with similar early behavior; their aggregate behavior becomes the new user's initial personal profile. This is a separate offline pipeline producing per-cluster default suggestions.

Third tradeoff: privacy. Personalization based on prior searches is sensitive — users in private/incognito mode expect no personalization. The serving layer must accept a 'no personalization' flag from the client and respect it (return global candidates only). Mention this as a configuration concern.

Fourth tradeoff: freshness of trending content. If a query suddenly spikes (breaking news, viral event), the global trie's hourly rebuild may miss it. Two solutions. (1) Decoupled trending overlay: a real-time aggregation pipeline tracks queries that have spiked in the last 10 minutes; the serving layer consults both the trie and the trending overlay, merging results. (2) More aggressive rebuilding: rebuild the trie every few minutes instead of hourly. Production typically uses (1) — the overlay is small, dedicated to fresh content, and supplements the broader trie.

Fifth tradeoff: multi-entity-type ranking. For a platform like LinkedIn (people + companies + jobs + posts) or Pinterest (pins + boards + users), the suggestions span multiple entity types. The ranker must produce a sensible mix — not all 'people' results above all 'jobs' or vice versa. Production uses 'slot-based' rendering: the top result is always the best overall match; lower slots are filled with diversity rules ensuring at least one of each entity type that has any strong candidates. Mention this — multi-entity diversity is a frequent senior question.

Common mistakes

  • Treating typeahead as pure prefix lookup against a global trie — ignores the personalization which is the entire point of the question
  • Scoring all global candidates per request with a heavy model — blows the latency budget by 5-10x
  • Forgetting feature caching — without cached features, the ranker can't run in <20ms
  • Skipping cold-start — new users get poor suggestions until cluster-based defaults are explained
  • Treating multi-entity-type results as 'just rank everything together' — without diversity rules the response collapses into one type

Likely follow-up questions

  • How would you support typo-tolerance (BK-tree or edit-distance fuzzy matching) without exceeding the latency budget?
  • What changes if you have to support 50 languages including non-Latin scripts (CJK, RTL)?
  • How would you A/B test a new ranker model in production without impacting the 99th percentile latency?
  • How would you implement a 'recent searches' section that's purely chronological alongside the ranked suggestions?
  • How would your design handle the autocomplete request volume during a viral global event (10x normal traffic)?

Practice Design Typeahead Suggestion live with an AI interviewer

Free, no sign-up required. Get real-time feedback on your design.

Practice these live

Frequently asked questions

How is Design Typeahead Suggestion different from Design Search Autocomplete?
Search Autocomplete typically emphasizes the global trie and trending updates. Design Typeahead Suggestion emphasizes personalization — combining user-specific history and contextual signals with the global candidates. Same surface, deeper personalization focus.
How long is the Design Typeahead interview?
45-60 minutes. Senior+ rounds expect the personalization vs latency tradeoff to be discussed explicitly, plus at least one of (cold-start, multi-entity, freshness).
Do I need to know specific ranking model architectures?
Naming 'gradient-boosted trees or a shallow neural network for fast online scoring' is enough. The interview signal is in the system architecture, not the model design.
What is the most important concept for Design Typeahead?
The latency vs personalization tradeoff. The senior signal hinges on whether the candidate keeps personalization within the latency budget via candidate pruning, feature caching, and offline aggregation — not by trying to score everything online with a heavy model.