Skip to main content

System Design Questions

Design API Gateway — System Design Interview Guide

Design API Gateway is a system-design interview that asks you to build the north-south entry point for a multi-service platform: every external client request lands at the gateway, which authenticates the caller, applies rate limits, routes to the right backend service, transforms the request and response, and emits observability. The hard part is keeping per-request latency under a few milliseconds while supporting thousands of distinct API contracts, per-tenant policies, and dynamic config reloads.

By Alex Chen, Founder, InterviewChamp.AI · Last verified

Reported in interviews at

  • Amazon
  • Google
  • Microsoft
  • Netflix
  • Apple

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

  • Route incoming requests to backend services based on hostname, path, headers, and method
  • Authenticate callers: API keys, OAuth tokens, mTLS client certs, signature schemes
  • Apply rate limits: per-API-key, per-tenant, per-IP, per-endpoint, with tiered quotas
  • Transform requests and responses: protocol translation (REST↔gRPC), header injection, payload reshaping
  • Emit observability: per-request logs, metrics, traces; track per-tenant usage for billing
  • Optional: response caching, request/response validation against schemas, plugin pipeline for custom logic

Non-functional requirements

  • Latency overhead: <5ms p99 added by gateway processing (excluding backend response time)
  • Throughput: ~1M+ requests/sec aggregate, ~10K+ requests/sec per instance
  • Scale: thousands of distinct API contracts, hundreds of thousands of tenants
  • Config push latency: new routes/policies live within 30 seconds
  • Availability: 99.99%; gateway failure takes down all external API traffic, so durability of the fleet matters
  • Per-tenant isolation: a misbehaving tenant must not degrade other tenants (noisy-neighbor protection)

Capacity estimation

Scale anchors: ~1M req/sec at peak, ~10K req/sec per gateway instance (conservative — many real gateways handle 50K+ per instance with tuning), giving ~100-200 instances behind a global edge CDN that terminates TLS and routes to the nearest gateway region.

Latency budget: total external-API latency budget is typically ~200-500ms p99 (depending on the API). The gateway eats ~5ms of that. Inside the gateway: ~1ms for routing decision (radix-tree lookup of the path against the route table), ~2ms for auth validation (token verify, often involving a cache lookup), ~1ms for rate-limit check (atomic counter increment), ~1ms for backend connection setup and request forwarding.

Memory: route table for 10K APIs × 100 routes each × ~1KB per route entry = ~1 GB hot route table in memory. Per-tenant state (rate-limit counters, recent-request tracking) at 100K tenants × ~1KB = ~100 MB. Easily fits in a single gateway instance's RAM.

Config change rate: ~100 route or policy changes per minute across the fleet (new routes, deploy-time canary shifts, rate-limit adjustments). With 100-200 instances, the config-push fan-out is small — a central config service can push diffs to all instances in seconds.

Auth cache: most requests reuse an authenticated identity (a single token serves many requests from the same client). Auth-cache hit rate is typically 95%+ at steady state; cache miss costs ~10ms (token introspection against an identity provider). With a hit, auth check is <1ms (cache lookup + signature verify of cached JWT).

Rate-limit volume: every request increments a counter. At 1M req/sec, the rate-limit storage tier sees 1M ops/sec — this is the most write-heavy component of the gateway data path. Production designs use sharded in-memory counters with periodic reconciliation, not a per-request hit to a centralized store.

High-level design

Four-layer pipeline per request: edge termination, gateway processing, plugin chain, and backend invocation.

Edge termination: a global edge CDN tier (separate from the gateway) terminates TLS at the closest geographic point of presence and forwards the decrypted HTTP request to a gateway region over the platform's internal backbone. Edge handles DDoS absorption and TLS offload; the gateway sees clean HTTP.

Gateway processing: stateless gateway instances arranged behind a regional load balancer. Each instance handles a request through the pipeline:

1. Parse and route: extract method/path/host/headers, look up the matching route in the route table (typically a radix tree for fast prefix matching). Route specifies the backend, the plugin chain, and per-route policies. 2. Authenticate: extract the credential (API key from header, bearer token, signature). Check the auth cache; on miss, validate against the identity service. Result: an authenticated principal (tenant_id, scopes). 3. Authorize: check per-route authorization (does this principal's scopes include access to this route?). Reject with 403 if not authorized. 4. Rate-limit: increment the per-(tenant, route) counter. If counter exceeds the configured quota for the principal's tier, reject with 429. 5. Transform request: apply route-specific transforms (e.g., translate REST request body to gRPC message, inject auth context headers for the backend, strip client-facing headers the backend shouldn't see). 6. Forward: pick a backend instance (gateway-internal service discovery or via a sidecar to the service mesh) and send the request. Hold the connection open for the response. 7. Transform response: similar to step 5 in reverse. Strip backend-internal headers, inject CORS headers, optionally reshape the body. 8. Observability: emit per-request log entry, increment per-(tenant, route) counters for usage/billing, emit a trace span.

Plugin chain: at multiple points in the pipeline (pre-auth, post-auth, pre-forward, post-response), pluggable middleware can run. Plugins are how the gateway extends — custom auth schemes, custom rate-limit logic, request validation against an OpenAPI schema, response caching, JWT-to-internal-token exchange. Plugins are configured per-route, not globally, so each API can opt in to its needed middleware without paying the cost on unrelated routes.

Control plane: stores the route table, plugin configs, rate-limit policies, and auth provider config. Changes propagate to gateway instances via a streaming subscription (similar to the service-mesh control plane). Gateway instances run on the last cached config if the control plane is unreachable — failing closed on config would block all traffic and is unacceptable.

Deep dive — the hard problem

Two deep dives: auth latency budget management, and per-tenant isolation against noisy neighbors.

Auth latency budget. The gateway can't add 100ms to every request validating tokens. The auth path must be optimized aggressively.

Layered auth cache. The first layer is in-process per-instance: an LRU cache of (token_hash → validated_principal, expires_at). A cache hit costs <100µs (a hashmap lookup plus a signature verify on the cached JWT to ensure the cached entry wasn't tampered with). Hit rates of 95%+ are typical because real workloads have client-side connection pooling and request batching that reuse the same token many times.

The second layer is a shared regional cache (an in-memory data store fronting the identity service). Cache miss in L1 hits L2; cache miss in L2 hits the identity service. The two-layer setup limits hot tokens (the most-used credentials) to L1 even at very high QPS.

JWT vs opaque tokens. Bearer tokens come in two flavors. JWTs are self-contained: the signature can be verified locally without network calls to the identity provider, so a JWT validation is purely CPU-bound (sub-millisecond). Opaque tokens require introspection against the identity provider (a network call). JWT-style is much faster on the gateway side but has tradeoffs around revocation (revoked JWTs are valid until expiry unless the gateway checks a denylist) and token size (JWTs are larger than opaque tokens, eating header bandwidth).

Production pattern: prefer JWTs for high-volume APIs where revocation latency of minutes is acceptable. Maintain a short-lived denylist (in-memory, refreshed every 30 seconds) for forcibly revoked JWTs. Use opaque tokens for sensitive operations where instant revocation matters.

Third trick: signature-verified-once. After the first signature verify of a JWT (sub-millisecond but not free), cache the parsed claims in L1 keyed by token hash. Subsequent requests with the same token skip the signature verify entirely — just compare the hash and use the cached claims. Trade: a cached entry must be invalidated if the JWT's underlying signing key rotates. The discipline is the same as mTLS cert rotation in a service mesh.

Per-tenant isolation against noisy neighbors. The biggest operational pain at gateway scale is one tenant's bad behavior degrading service for everyone else. Three mechanisms.

Per-tenant rate limits with hard quotas. Every tenant has a per-API quota (e.g., 1000 req/sec for tier-basic, 10000 req/sec for tier-pro). Counters are tracked per-tenant, and exceeding triggers 429 responses. This caps the request load any single tenant can produce.

Per-tenant connection limits. A misbehaving client could hold open thousands of connections, each making slow requests, exhausting the gateway's connection pool. Set a per-tenant connection cap. Reject new connections from the same tenant beyond the cap with a 503 carrying a Retry-After header.

Backend-side per-tenant queuing. When a backend service is slow (e.g., a database degradation affecting tenant_42's specific shard), tenant_42's requests queue up at the backend and the gateway's connection pool to that backend fills with tenant_42 traffic, starving other tenants. Mitigation: maintain per-tenant outbound connection budgets at the gateway. If a tenant has more than N in-flight backend requests, reject further requests from that tenant with 503 — better to fail one tenant fast than to drag down the whole platform.

Third tradeoff: caching. A subset of GET responses are cacheable (lookup-heavy, low-volatility data). The gateway can cache these per-route, keyed by request signature (path + query + relevant headers), with a TTL per the route's cache policy. Cache hits skip the backend entirely — massive throughput win for cacheable APIs. Cache invalidation when the underlying data changes is the hard part; production setups typically rely on short TTLs (10-60 seconds) rather than explicit invalidation for most APIs.

Fourth tradeoff: plugin pipeline cost. Every plugin adds latency. A route with 10 plugins pays 10× the per-plugin cost. Production discipline: keep the hot path lean (auth + rate-limit + route + forward are the only mandatory plugins); add transformation and validation plugins per-route only where they're needed. Avoid global plugins that run on every request; they're a tax on every API.

Fifth: back-pressure. When a backend is slow, requests pile up at the gateway. Without back-pressure, the gateway's memory grows unbounded and it eventually crashes. Hard upper bounds on per-backend in-flight requests (with 503 responses when exceeded) are the standard defense. This is a different concern from per-tenant queuing — it's per-backend-service, protecting the gateway from a single bad backend regardless of which tenants are calling it.

Common mistakes

  • Treating auth as a per-request call to the identity service — at 1M req/sec the identity service melts; layered caches and JWT-style local validation are mandatory
  • Skipping per-tenant isolation — without per-tenant connection caps and rate limits, one noisy tenant can take down service for everyone
  • Globalizing every plugin — running response-validation on every API call when only 10% need it triples the gateway's CPU cost
  • Forgetting back-pressure — when a backend slows, gateway memory grows until the gateway crashes; hard caps on in-flight per-backend are required
  • Conflating north-south (this question) with east-west (service mesh) — they have different latency budgets, different identity models, and different failure modes

Likely follow-up questions

  • How would you support GraphQL APIs where a single request can fan out to many backends in one operation?
  • How would you implement API versioning at the gateway — old clients calling v1 while new clients call v2, with the gateway translating?
  • What changes if you need to support WebSocket and other long-lived connections in addition to short-lived REST/gRPC?
  • How would you debug a latency regression in the gateway — is it auth, rate-limit, a misbehaving plugin, a slow backend, or the network?
  • How would you handle a sudden 100x burst from one tenant (legitimate traffic spike) — block them as a noisy neighbor, or scale to absorb it?

Practice Design API Gateway live with an AI interviewer

Free, no sign-up required. Get real-time feedback on your design.

Practice these live

Frequently asked questions

Isn't API Gateway just a reverse proxy?
A reverse proxy is the bottom layer. API Gateway adds auth, rate-limiting, transformation, plugin pipeline, multi-tenant policy enforcement, and observability — all of which a plain reverse proxy doesn't do natively. Production gateways are typically built on top of a high-performance proxy (Envoy, NGINX) with the gateway logic added as plugins or extensions.
How is API Gateway different from Service Mesh?
API Gateway handles north-south traffic (external clients calling into the platform) with public-internet auth schemes (API keys, OAuth, mTLS) and rate-limiting per tenant. Service Mesh handles east-west traffic (services inside the platform calling each other) with internal identities (workload mTLS) and resilience policies (retries, circuit breakers). Different volume profiles, latency budgets, and threat models. Most production platforms run both.
Do I need to name a specific gateway product?
Mentioning the proxy-plus-plugin-pipeline architecture and naming the high-performance proxies that gateways are commonly built on (Envoy, NGINX, HAProxy) is enough. Going deeper on specific gateway products (Kong, AWS API Gateway, Apigee) is optional bonus.
What is the most important concept for Design API Gateway?
Auth latency budget plus per-tenant isolation. The senior signal hinges on (a) recognizing that naive per-request token introspection doesn't scale and proposing layered caching with JWT-local-verify, and (b) explaining how the gateway prevents a noisy tenant from degrading other tenants through per-tenant connection caps and backend-side outbound budgets.