Design Rate Limiter — System Design Interview Guide
Design Rate Limiter is a system-design interview that asks you to build a distributed component that allows or denies incoming requests based on per-key rate limits: a user, an API key, or an IP address is allowed N requests per window across a fleet of servers. The hard part is enforcing the limit globally without serializing every request through a central store.
By Alex Chen, Founder, InterviewChamp.AI · Last verified
Reported in interviews at
- Amazon
- Meta
- Microsoft
Sourced from Glassdoor, Levels.fyi, and Blind interview reports.
Functional requirements
- Allow or deny each incoming request based on a configurable rate limit per key
- Support multiple limit policies: per-second, per-minute, per-hour, per-day
- Support multiple keying strategies: user ID, API key, IP address, endpoint, or combinations
- Return rate-limit metadata in response headers (limit, remaining, reset time)
- Configurable rejection behavior: hard reject with 429, queue, or degrade
- Operate at the edge or as a sidecar in front of every service
Non-functional requirements
- Decision latency: <2ms p99 added to each request
- Accuracy: limits should be enforced within ~5% of configured (small over-allow during edge conditions is acceptable)
- Scale: ~10M+ unique keys, ~100K+ decisions/sec per data center
- Availability: 99.99%; the rate limiter must fail open or closed depending on policy, never disrupt the call path on its own
- Storage: minimal — limiter state is ephemeral per-window, not durable history
Capacity estimation
Scale anchors: a single mid-sized API at peak handles ~100K requests/sec; a large platform handles ~1M+/sec across data centers. The rate limiter sits in front of every request, so its decision QPS equals the request QPS.
Unique keys: depends on keying strategy. Per-user limits on a 10M-user platform yield ~10M keys. Per-IP limits yield ~100M+ keys (mobile carrier NATs share IPs across thousands of users). Per-API-key limits yield ~1M keys for an API platform.
State size: each key needs a counter for each active window. A token-bucket-style limiter stores ~50 bytes per key (current token count, last refill timestamp, configured rate, configured burst). At 10M keys × 50 bytes = 500 MB of active state — easily in-memory per node.
Counter writes: every allowed request decrements the counter. At 100K req/sec that's 100K writes/sec to the limiter state. A single in-memory store node can handle ~1M counter operations/sec easily, but if state is shared across data center nodes, the cross-node coordination cost is the bottleneck (see deep dive).
Window rotation: fixed-window counters need O(num_keys) work each window boundary to reset. Sliding-window or token-bucket avoids this by lazy-rotation on each request. Production systems prefer the latter.
High-level design
The architecture depends on whether the rate limiter is centralized (one shared store all nodes consult) or distributed (each node enforces locally with periodic sync). Production systems use a hybrid.
Local-first hybrid: each application node maintains its own in-memory counters per key, enforcing limits for the slice of traffic it sees. A periodic sync (every 100-500ms) reports counter deltas to a shared coordination layer; the shared layer redistributes adjusted counters so the system converges on the global limit.
The rate-limit decision is made entirely in local memory per request: <1ms decision latency, no network call required. The sync is asynchronous and out of the request path.
Key routing: when 'per-user' limits are configured, requests for a given user can land on any application node. Two options. (1) Sticky routing: route all requests for user X to node N via consistent hashing; node N owns user X's counter completely; no sync needed. (2) Sharded sync: any node can serve any user, all nodes share counter state via the periodic sync. Sticky routing is cleaner and the standard answer when traffic distribution is even; sharded sync handles bursty single-user traffic better but adds sync cost.
For very-high-fan-out keys (a single global API key serving millions of requests/sec), even sticky routing can saturate one node. Production systems detect this and shard the heavy key across multiple nodes with per-shard sub-limits that sum to the global limit. Mention this as a degenerate case.
Algorithms: see deep dive for the choice between fixed-window, sliding-window, leaky-bucket, and token-bucket.
Fail-open vs fail-closed: if the rate limiter itself becomes unavailable, do we allow all requests (fail-open) or deny all (fail-closed)? Production default is fail-open — the rate limiter is a defense, not a dependency. Drilling into this tradeoff is a senior signal.
Deep dive — the hard problem
Two deep dives: the algorithm choice and the distributed coordination problem.
Algorithm choice — four canonical algorithms.
Fixed window: divide time into discrete buckets (e.g. each minute). Maintain a counter per key per bucket; allow the request if the current bucket's counter is below the limit, then increment. Simple, O(1) per request, but suffers from boundary bursting: a user can send N requests at minute-second-59 and another N at minute-second-00 — 2N in one second when the configured limit is N/minute.
Sliding window log: store a timestamp for every request in the window; the decision is 'count of timestamps in the last 60 seconds is below N'. Accurate but O(N) memory per key — at 10K req/min/key the state blows up.
Sliding window counter: store the previous bucket's count and the current bucket's count; estimate the count over the last window as (previous_count × (1 - elapsed_fraction)) + current_count. Approximate but bounded in memory; the standard answer for high-traffic systems.
Token bucket: each key has a bucket that fills at rate R tokens/sec up to capacity B. Each request consumes one token; if no tokens, deny. Allows bursts up to B requests, then sustained rate R. Production-favorite because it handles bursts well — most APIs want to allow a brief spike but cap the long-term rate.
Leaky bucket: similar to token bucket but with a queue interpretation; requests in excess of the rate are queued, not rejected. Useful for traffic-shaping (network routers, ingestion pipelines) but less common for API rate limiting.
Distributed coordination: the hardest part of the problem.
Option A: centralized in-memory store. All nodes consult a shared in-memory counter store (e.g. an in-memory cache cluster) per request. Decision is exact but requires a network round-trip per request — 1-2ms added latency, and the central store is a hot path that must scale linearly with QPS.
Option B: local enforcement, no sync. Each node enforces locally based on the traffic it sees. Decision is sub-millisecond and zero network calls, but the total system can overshoot by N_nodes × configured_limit when traffic is concentrated.
Option C: hybrid sync — local enforcement plus periodic aggregation. Each node enforces locally; every 100-500ms, all nodes sync deltas via a shared coordination layer; the layer redistributes counters to converge on the global limit. Decision is local (fast); accuracy is within ~5% of configured limits even under uneven traffic distribution. This is the standard production answer.
For very strict limits (e.g. 'exactly 100 requests/sec, never more'), option A is required despite latency. For 'approximately N requests/sec under load', option C is the practical answer.
Third tradeoff: the response. A denied request returns HTTP 429 with retry-after metadata. Production limiters also include rate-limit info in headers on every response (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) so well-behaved clients can self-pace. Mention this — the response shape is part of the design.
Fourth tradeoff: hierarchical limits. A user has a per-second limit, an API key has a per-day limit, an endpoint has a per-IP limit. Requests are checked against all applicable limits; deny if any fails. Production limiters compose the limits in a deterministic order. Mention this as a configuration concern.
Common mistakes
- Defaulting to a synchronous centralized store on every request — adds latency to the entire call path
- Using fixed-window without acknowledging the boundary-bursting problem
- Forgetting the fail-open vs fail-closed decision — the rate limiter must not become a single point of failure for the system it's protecting
- Treating the rate limiter as exact — at scale, the choice is between latency cost (centralized) and small over-allow (local), and exact is rarely worth the latency
- Skipping response headers — well-behaved clients self-pace using rate-limit headers, reducing limiter pressure
Likely follow-up questions
- How would your design support per-endpoint limits where each API endpoint has a different rate limit?
- What changes if a single 'whale' API key generates 50% of all traffic on its own?
- How would you implement a 'burst then rate' limit (allow 100 in any 1-sec window, then 50/sec sustained)?
- How would your design handle a denial-of-service attack from 1M distinct IPs each sending 10 req/sec?
- How would you support per-tier limits where free users get 1K/day, paid users 100K/day, enterprise users unlimited?
Practice Design Rate Limiter live with an AI interviewer
Free, no sign-up required. Get real-time feedback on your design.
Practice these liveFrequently asked questions
- How long is the Design Rate Limiter interview?
- 30-45 minutes typical. Often used as a warm-up before a harder problem. The expectation is one algorithm with tradeoffs, plus a clear distributed-state strategy.
- Do I need to name every algorithm by name?
- Name 2-3 (fixed window, token bucket, sliding window counter) and pick one with a reason. Listing all four without picking one is a weak signal.
- Should I cover hierarchical limits (per-user × per-endpoint × per-IP)?
- Mention them as a composition pattern in the last 5 minutes. They don't change the underlying architecture; they're a configuration concern.
- What is the most important concept for Design Rate Limiter?
- The latency-vs-accuracy tradeoff in distributed coordination. The senior signal hinges on whether the candidate proposes local enforcement with periodic sync (the hybrid) rather than defaulting to a synchronous centralized store on every request.