Design Search Autocomplete — System Design Interview Guide
Design Search Autocomplete is a system-design interview that asks you to build the suggestion engine behind a search box: as the user types each character, return the top 10 most likely query completions in under 100ms. The hard part is the prefix data structure and updating it from a stream of new queries.
By Alex Chen, Founder, InterviewChamp.AI · Last verified
Reported in interviews at
- Amazon
- Meta
- Microsoft
Sourced from Glassdoor, Levels.fyi, and Blind interview reports.
Functional requirements
- For each typed prefix, return the top N (e.g., 10) most likely completions
- Rank completions by popularity (frequency of past queries with that prefix)
- Update the suggestion data continuously from new user queries
- Personalize suggestions when user context (history, location) is available
- Filter out inappropriate, illegal, or recently-spiked spam queries
Non-functional requirements
- Latency: <100ms p99 from keypress to suggestions visible (network + compute combined)
- Read-heavy: ~10K typed-keystroke queries/sec/region peak
- Freshness: trending queries should appear in suggestions within an hour of starting to trend
- Scale: ~5B searches/day globally for a major engine, ~10B keystrokes/day generating suggestion requests
Capacity estimation
Public scale for a major search engine (2023): ~5B queries/day. Each query generates ~10 autocomplete requests as the user types (avg query length 20 chars, with debouncing requests fire roughly every 2 chars). Total autocomplete request volume: ~50B/day = ~580K requests/sec average, peak ~2M/sec globally. Per-region peak is much lower — sharding by region gives each region ~50K-200K requests/sec to handle.
Unique query corpus: ~100M distinct queries account for >95% of all search traffic (Pareto distribution); the long tail of ~1B+ rare queries collectively accounts for <5%. The data structure only needs to serve the top 1-10M queries per region efficiently.
Storage for the prefix trie: ~10M entries × average 30 bytes (query string, frequency, rank metadata) = ~300 MB compressed. Easily fits in RAM per node. With replication and per-region copies (US, EU, APAC, etc.), total RAM across the fleet is ~5-10 GB. Query log for offline aggregation: ~5B queries × 200 bytes = ~1 TB/day, written to a streaming log for analytics.
High-level design
Two pipelines: serving (read path) and aggregation (write path). They are intentionally decoupled.
Serving path: client sends typed prefix to the autocomplete service. The service holds an in-memory prefix trie (or compressed equivalent — see deep dive) where each terminal node has a precomputed sorted list of top-N completions. Lookup: traverse the trie for the prefix, return the top-N at the prefix node. This is O(prefix length) for traversal and O(1) to read the precomputed top-N — both bounded and fast. The serving node is replicated per region for redundancy and latency; clients hit the nearest replica through a region-aware load balancer.
Aggregation path: every search query (or even every typed keystroke) emits an event to a streaming log. A batch aggregation pipeline runs on a rolling window — typically hourly — and computes (prefix, query, count) tuples for the top 1M queries. The aggregation produces a new version of the trie data, which is then loaded by the serving nodes in a rolling deploy.
The trie is rebuilt from scratch every hour (or every few minutes for trending-sensitive engines) rather than mutated in place. This keeps the serving nodes' data structure simple and immutable per version — readers don't need locks. The new version is published to an in-memory data plane that serving nodes pull and atomically swap into place.
Deep dive — the hard problem
Two deep dives: the data structure and the trending-update problem.
Data structure: a classical trie is the textbook answer but is memory-hungry — naïvely, each node holds 26+ pointers. Real systems use either a compressed trie (radix tree / Patricia trie — collapse single-child chains into a single node holding a substring) or a DAWG (directed acyclic word graph — share common suffixes across queries). For the top-1M query corpus, a compressed trie typically uses ~5× less memory than a basic trie. Each terminal node holds the top-10 completions precomputed; the precompute step picks the 10 highest-frequency descendants of that prefix node. Storing precomputed top-N at every node (not just leaves) is what makes lookup O(prefix length) — without precomputed lists, you'd need to traverse all descendants at query time.
Updates and freshness: rebuilding the trie hourly is fine for stable popularity but misses trending events (breaking news where a query spikes from zero to a million in 10 minutes). Two solutions. (1) Decoupled hot-trending overlay: a tiny secondary structure tracks queries that spiked in the last 10 minutes; serving nodes consult both and merge results. (2) Real-time aggregation: a streaming aggregation pipeline computes top-K with sketch data structures (Count-Min Sketch + heavy-hitters) on the live query stream; serving nodes pull updates every few minutes. (1) is simpler; (2) is what large engines actually run.
Third tradeoff: personalization vs caching. Pure personalized suggestions can't share cache entries across users — every user's results differ. Production systems split into a 'global suggestions' tier (cached, shared) and a 'personal re-ranking' tier (cheap, post-process the global top-50 with user features to pick a personalized top-10). Discuss this tradeoff: full personalization at autocomplete latency is expensive; the global + rerank pattern is the standard answer.
Finally, content safety. Autocomplete shouldn't suggest hate speech, illegal content, or product-defaming spam. Solution: maintain a blocklist applied at trie-build time (filter out blacklisted queries before they reach the trie) plus an ML safety classifier scoring marginal suggestions. The blocklist is the floor; the classifier handles the long tail. Mention both layers.
Common mistakes
- Building a basic trie without acknowledging memory cost at 10M+ entries — the compressed trie is the production answer
- Computing top-N at query time by traversing all descendants — kills latency on common prefixes ('a', 'th')
- Designing the aggregation pipeline as synchronous with serving — couples freshness to serving stability
- Forgetting trending — interviewers will ask 'what about breaking news queries'
- Skipping content-safety filtering — autocomplete is a known liability surface, and missing it is a hire-reduce signal
Likely follow-up questions
- How would you support autocomplete in 50 languages including non-Latin scripts (Chinese, Arabic)?
- How would you implement typo tolerance ('teh' → 'the')?
- What changes if you have to support 1B users typing simultaneously?
- How would you A/B test a new ranking model in autocomplete without harming the user experience?
- How would you support entity completion ('barack' → 'barack obama' with rich snippet) vs raw text?
Practice Design Search Autocomplete live with an AI interviewer
Free, no sign-up required. Get real-time feedback on your design.
Practice these liveFrequently asked questions
- Is Design Search Autocomplete a common interview question?
- Yes — it's a frequent senior+ system design at Google and Amazon, and the canonical 'show me you can design a data-structure-heavy system' question. Less common at Meta and snap, more common at search-product companies.
- Do I need to know the trie variant by name (radix, Patricia, DAWG)?
- Naming 'compressed trie' or 'radix tree' is enough. Drawing the exact compression algorithm is bonus signal; mostly interviewers want you to know why a plain trie wastes memory and to mention an alternative.
- How long is the Design Autocomplete round?
- 45–60 minutes. Senior rounds extend it with personalization, multilingual support, and content safety. Source: Glassdoor Google L5/L6 reports 2023–2024.
- Should I cover spell-correction in Design Autocomplete?
- If time permits in the last 10 minutes. Spell-correction adds an edit-distance lookup layer (e.g., a BK-tree) that runs in parallel with the trie lookup; merge results. Not required for hire, but signals depth.