Skip to main content

System Design Questions

Design Rate Limiter — System Design Interview Guide

Design Rate Limiter is a system-design interview that asks you to build a distributed component that allows or denies incoming requests based on per-key rate limits: a user, an API key, or an IP address is allowed N requests per window across a fleet of servers. The hard part is enforcing the limit globally without serializing every request through a central store.

By Alex Chen, Founder, InterviewChamp.AI · Last verified

Reported in interviews at

  • Google
  • Amazon
  • Meta
  • LinkedIn
  • Microsoft

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

  • Allow or deny each incoming request based on a configurable rate limit per key
  • Support multiple limit policies: per-second, per-minute, per-hour, per-day
  • Support multiple keying strategies: user ID, API key, IP address, endpoint, or combinations
  • Return rate-limit metadata in response headers (limit, remaining, reset time)
  • Configurable rejection behavior: hard reject with 429, queue, or degrade
  • Operate at the edge or as a sidecar in front of every service

Non-functional requirements

  • Decision latency: <2ms p99 added to each request
  • Accuracy: limits should be enforced within ~5% of configured (small over-allow during edge conditions is acceptable)
  • Scale: ~10M+ unique keys, ~100K+ decisions/sec per data center
  • Availability: 99.99%; the rate limiter must fail open or closed depending on policy, never disrupt the call path on its own
  • Storage: minimal — limiter state is ephemeral per-window, not durable history

Capacity estimation

Scale anchors: a single mid-sized API at peak handles ~100K requests/sec; a large platform handles ~1M+/sec across data centers. The rate limiter sits in front of every request, so its decision QPS equals the request QPS.

Unique keys: depends on keying strategy. Per-user limits on a 10M-user platform yield ~10M keys. Per-IP limits yield ~100M+ keys (mobile carrier NATs share IPs across thousands of users). Per-API-key limits yield ~1M keys for an API platform.

State size: each key needs a counter for each active window. A token-bucket-style limiter stores ~50 bytes per key (current token count, last refill timestamp, configured rate, configured burst). At 10M keys × 50 bytes = 500 MB of active state — easily in-memory per node.

Counter writes: every allowed request decrements the counter. At 100K req/sec that's 100K writes/sec to the limiter state. A single in-memory store node can handle ~1M counter operations/sec easily, but if state is shared across data center nodes, the cross-node coordination cost is the bottleneck (see deep dive).

Window rotation: fixed-window counters need O(num_keys) work each window boundary to reset. Sliding-window or token-bucket avoids this by lazy-rotation on each request. Production systems prefer the latter.

High-level design

The architecture depends on whether the rate limiter is centralized (one shared store all nodes consult) or distributed (each node enforces locally with periodic sync). Production systems use a hybrid.

Local-first hybrid: each application node maintains its own in-memory counters per key, enforcing limits for the slice of traffic it sees. A periodic sync (every 100-500ms) reports counter deltas to a shared coordination layer; the shared layer redistributes adjusted counters so the system converges on the global limit.

The rate-limit decision is made entirely in local memory per request: <1ms decision latency, no network call required. The sync is asynchronous and out of the request path.

Key routing: when 'per-user' limits are configured, requests for a given user can land on any application node. Two options. (1) Sticky routing: route all requests for user X to node N via consistent hashing; node N owns user X's counter completely; no sync needed. (2) Sharded sync: any node can serve any user, all nodes share counter state via the periodic sync. Sticky routing is cleaner and the standard answer when traffic distribution is even; sharded sync handles bursty single-user traffic better but adds sync cost.

For very-high-fan-out keys (a single global API key serving millions of requests/sec), even sticky routing can saturate one node. Production systems detect this and shard the heavy key across multiple nodes with per-shard sub-limits that sum to the global limit. Mention this as a degenerate case.

Algorithms: see deep dive for the choice between fixed-window, sliding-window, leaky-bucket, and token-bucket.

Fail-open vs fail-closed: if the rate limiter itself becomes unavailable, do we allow all requests (fail-open) or deny all (fail-closed)? Production default is fail-open — the rate limiter is a defense, not a dependency. Drilling into this tradeoff is a senior signal.

Deep dive — the hard problem

Two deep dives: the algorithm choice and the distributed coordination problem.

Algorithm choice — four canonical algorithms.

Fixed window: divide time into discrete buckets (e.g. each minute). Maintain a counter per key per bucket; allow the request if the current bucket's counter is below the limit, then increment. Simple, O(1) per request, but suffers from boundary bursting: a user can send N requests at minute-second-59 and another N at minute-second-00 — 2N in one second when the configured limit is N/minute.

Sliding window log: store a timestamp for every request in the window; the decision is 'count of timestamps in the last 60 seconds is below N'. Accurate but O(N) memory per key — at 10K req/min/key the state blows up.

Sliding window counter: store the previous bucket's count and the current bucket's count; estimate the count over the last window as (previous_count × (1 - elapsed_fraction)) + current_count. Approximate but bounded in memory; the standard answer for high-traffic systems.

Token bucket: each key has a bucket that fills at rate R tokens/sec up to capacity B. Each request consumes one token; if no tokens, deny. Allows bursts up to B requests, then sustained rate R. Production-favorite because it handles bursts well — most APIs want to allow a brief spike but cap the long-term rate.

Leaky bucket: similar to token bucket but with a queue interpretation; requests in excess of the rate are queued, not rejected. Useful for traffic-shaping (network routers, ingestion pipelines) but less common for API rate limiting.

Distributed coordination: the hardest part of the problem.

Option A: centralized in-memory store. All nodes consult a shared in-memory counter store (e.g. an in-memory cache cluster) per request. Decision is exact but requires a network round-trip per request — 1-2ms added latency, and the central store is a hot path that must scale linearly with QPS.

Option B: local enforcement, no sync. Each node enforces locally based on the traffic it sees. Decision is sub-millisecond and zero network calls, but the total system can overshoot by N_nodes × configured_limit when traffic is concentrated.

Option C: hybrid sync — local enforcement plus periodic aggregation. Each node enforces locally; every 100-500ms, all nodes sync deltas via a shared coordination layer; the layer redistributes counters to converge on the global limit. Decision is local (fast); accuracy is within ~5% of configured limits even under uneven traffic distribution. This is the standard production answer.

For very strict limits (e.g. 'exactly 100 requests/sec, never more'), option A is required despite latency. For 'approximately N requests/sec under load', option C is the practical answer.

Third tradeoff: the response. A denied request returns HTTP 429 with retry-after metadata. Production limiters also include rate-limit info in headers on every response (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) so well-behaved clients can self-pace. Mention this — the response shape is part of the design.

Fourth tradeoff: hierarchical limits. A user has a per-second limit, an API key has a per-day limit, an endpoint has a per-IP limit. Requests are checked against all applicable limits; deny if any fails. Production limiters compose the limits in a deterministic order. Mention this as a configuration concern.

Common mistakes

  • Defaulting to a synchronous centralized store on every request — adds latency to the entire call path
  • Using fixed-window without acknowledging the boundary-bursting problem
  • Forgetting the fail-open vs fail-closed decision — the rate limiter must not become a single point of failure for the system it's protecting
  • Treating the rate limiter as exact — at scale, the choice is between latency cost (centralized) and small over-allow (local), and exact is rarely worth the latency
  • Skipping response headers — well-behaved clients self-pace using rate-limit headers, reducing limiter pressure

Likely follow-up questions

  • How would your design support per-endpoint limits where each API endpoint has a different rate limit?
  • What changes if a single 'whale' API key generates 50% of all traffic on its own?
  • How would you implement a 'burst then rate' limit (allow 100 in any 1-sec window, then 50/sec sustained)?
  • How would your design handle a denial-of-service attack from 1M distinct IPs each sending 10 req/sec?
  • How would you support per-tier limits where free users get 1K/day, paid users 100K/day, enterprise users unlimited?

Practice Design Rate Limiter live with an AI interviewer

Free, no sign-up required. Get real-time feedback on your design.

Practice these live

Frequently asked questions

How long is the Design Rate Limiter interview?
30-45 minutes typical. Often used as a warm-up before a harder problem. The expectation is one algorithm with tradeoffs, plus a clear distributed-state strategy.
Do I need to name every algorithm by name?
Name 2-3 (fixed window, token bucket, sliding window counter) and pick one with a reason. Listing all four without picking one is a weak signal.
Should I cover hierarchical limits (per-user × per-endpoint × per-IP)?
Mention them as a composition pattern in the last 5 minutes. They don't change the underlying architecture; they're a configuration concern.
What is the most important concept for Design Rate Limiter?
The latency-vs-accuracy tradeoff in distributed coordination. The senior signal hinges on whether the candidate proposes local enforcement with periodic sync (the hybrid) rather than defaulting to a synchronous centralized store on every request.