Skip to main content

System Design Questions

Design Autoscaling System — System Design Interview Guide

Design Autoscaling System is a system-design interview that asks you to build the infrastructure that automatically resizes a workload's capacity in response to demand: scale out when load rises, scale in when load drops, scale up resources per instance when needed, and do all of it without dropping traffic or thrashing. The hard part is reacting fast enough for traffic spikes, slow enough to avoid oscillation, and combining horizontal and vertical scaling without conflict.

By Alex Chen, Founder, InterviewChamp.AI · Last verified

Reported in interviews at

  • Amazon
  • Google
  • Microsoft
  • Netflix
  • Uber

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

  • Horizontal scaling: increase or decrease the number of pod/instance replicas based on observed load
  • Vertical scaling: adjust per-pod CPU and memory requests/limits based on actual usage
  • Support multiple scaling signals: CPU utilization, memory, request rate, queue depth, custom application metrics
  • Pluggable scaling policies per workload: aggressive (web tier) vs conservative (batch jobs)
  • Predictive scaling: pre-warm capacity for known traffic patterns (e.g., daily peak at 9am, weekly Black Friday)
  • Cooldown periods to prevent oscillation; safe scale-down with connection draining

Non-functional requirements

  • Scale-out latency: from breach of scale-up threshold to new pods serving traffic, <60 seconds p95
  • Scale-in latency: less urgent; safe to take 5-10 minutes for orderly drain
  • Decision frequency: scaling decisions every 15-30 seconds per workload
  • Scale: ~10K workloads, ~1M pods at peak, decisions across the fleet evaluated continuously
  • Cost efficiency: target 70-80% utilization average across the fleet; over-provisioning is the largest cloud-bill line item
  • Availability: autoscaler failure must NEVER cause running pods to be killed; safe fallback is 'do nothing'

Capacity estimation

Scale anchors: ~10K workloads (each workload is a service deployment), ~100 pods average per workload, ~1M pods at peak across the fleet. With scaling decisions every 15 seconds per workload: ~10K decisions / 15 sec = ~700 decisions/sec. Modest by ingest standards but each decision has fan-out (querying metrics, checking quotas, calling orchestrator API).

Metric pipeline volume: each pod emits ~10 core metrics every 15 seconds = ~700K metric events/sec at fleet scale. The autoscaler reads aggregates (per-workload averages and percentiles), not raw events, so it consumes ~10K aggregate metrics/sec — manageable.

Scaling action cost: when scaling out, the orchestrator pulls a container image (often hundreds of MB), schedules onto a node, allocates resources, starts the container, runs the readiness probe. End-to-end ~30-90 seconds for a cold start, ~5-15 seconds if the image is cached on the target node. This is the bottleneck on scale-out latency.

Cold-start budget: a sudden 2x traffic spike on a workload sized at 100 pods needs ~100 new pods within minutes. If new-pod cold-start is 60 seconds and you can start 10 pods/second (limited by image-pull parallelism on the target nodes), you can add 600 pods in a minute — enough for most spikes. Sustained or larger spikes need pre-warmed warm pools.

Cost sensitivity: cloud bill is dominated by over-provisioning. A 1000-pod workload running at 30% CPU is wasting ~70% of its compute spend. The autoscaler's job is to push average utilization toward 70-80% without crossing into the 'one bad spike kills us' zone.

High-level design

Three layers: metric pipeline, decision engine, and action executor.

Metric pipeline: every pod emits utilization metrics (CPU, memory) and application metrics (request rate, queue depth, custom counters) to a metrics aggregation system. The autoscaler subscribes to per-workload aggregates: average CPU across all pods in workload X over the last 30 seconds, p95 latency, queue depth.

Decision engine: a stateful service that, every 15-30 seconds per workload, runs the scaling algorithm and decides whether to add, remove, or hold steady. Inputs: current pod count, target metric vs observed metric, last scaling action timestamp (for cooldown enforcement), workload-specific policy.

The core algorithm — proportional scaling — is straightforward: desired_replicas = ceil(current_replicas × (observed_metric / target_metric)). If target CPU is 70% and observed is 105%, desired = current × 1.5. Round up, clamp to the workload's min/max replicas, and compare to current — if different by more than a tolerance band (e.g., ±10%), trigger an action.

Cooldown enforcement: after a scale-out, lock the workload out of scale-in for several minutes (typically 5-10) so a momentary dip in load doesn't immediately remove the pods you just added. Scale-out lockout is much shorter (30-60 seconds) — you want to be able to react fast on a real spike.

Action executor: calls the orchestrator API (Kubernetes Deployment scale, in modern setups; equivalent API in any orchestrator) to set the replica count. The orchestrator handles the actual pod creation/deletion, scheduling, and rollout.

Vertical scaling runs as a separate flow: a recommender service observes pod resource usage over a longer window (hours to days), computes a recommended CPU/memory request based on the usage distribution (typically the 95th percentile + a buffer), and emits a recommendation. Applying the recommendation requires restarting the pod with new resource requests — high-risk operation, typically applied during deployment rollouts rather than continuously.

Critical separation: horizontal and vertical scaling on the same workload at the same time can conflict. If HPA scales out because CPU is high and VPA simultaneously sizes up CPU per pod, you double-scale. Production rule: pick one as the primary scaling axis for a given workload, use the other only for periodic right-sizing (e.g., HPA for elasticity, VPA recommendations applied weekly).

Deep dive — the hard problem

Two deep dives: predictive vs reactive scaling, and scale-down hysteresis.

Predictive vs reactive scaling. Pure reactive scaling waits for a metric to breach a threshold before adding capacity. Works fine for gradual changes but fails on sharp spikes — by the time CPU hits 90%, the spike has already overwhelmed the existing pods and response latency has degraded. The latency between 'metric breached' and 'new pods serving traffic' is the scaling lag.

Production setups layer predictive scaling on top of reactive. The predictive layer learns daily/weekly patterns from historical traffic: every Tuesday at 9am, the API tier serves 3x the overnight traffic. The predictor adds capacity at 8:55am in anticipation, so the spike at 9am is absorbed without breaching the reactive threshold.

Predictive models. A simple but effective approach: store the last N days of traffic at 1-minute granularity, compute the expected traffic at time t as the median (or 90th percentile) of the same time-of-day across the last N days. For weekly patterns, segment by day-of-week. For special events (Black Friday, product launches), accept manual scaling overrides — the model can't predict events it hasn't seen.

For latency-sensitive workloads, predictive scaling shines. For workloads with chaotic traffic that doesn't follow a daily pattern (event-driven, batch processing), predictive offers little — stay reactive and accept the scaling lag.

Third option: warm pools. Keep a small number of pre-started but unused pods in a warm pool, ready to receive traffic in seconds. The warm pool absorbs the first 30-60 seconds of a spike while the orchestrator brings up cold-start replicas to backfill. Pay for the warm pool capacity (a fraction of normal capacity, e.g., 5%) in exchange for cold-start mitigation. Used for tiers where seconds matter.

Scale-down hysteresis is the other critical concern. Scaling down too aggressively causes thrashing: you remove a pod, the per-pod load on the remaining pods goes up, your CPU metric crosses the scale-up threshold, you add a pod back. Oscillation drains the cluster of stability.

Three mechanisms prevent thrashing.

Long scale-down cooldown: after any scaling decision, lock scale-down for several minutes (typically 5-10). This gives the metric time to stabilize at the new pod count before considering another change. Production HPAs default to 5-minute scale-down cooldowns; scale-up cooldowns are seconds because spike-response is more urgent.

Asymmetric tolerance bands: only scale down if observed_metric is below target by a margin (e.g., target 70%, scale down only if observed <55%). Symmetric bands (scale down on 65% with target 70%) cause thrashing because a 5% noise band routinely crosses both directions.

Step-limited scaling: cap the magnitude of a single scaling action. Even if the formula says 'scale from 100 to 30 pods', cap the step at 'remove at most 10% per cooldown cycle'. Prevents over-correcting on a momentary lull. Larger workloads use smaller step sizes (a 1000-pod workload caps step at 5%); smaller workloads use larger step sizes (a 5-pod workload caps step at 1 pod or 20%).

Third tradeoff: connection draining on scale-down. When a pod is selected for removal, terminating it abruptly drops every in-flight request. Graceful drain: the orchestrator sends SIGTERM to the pod, the pod's signal handler tells the load balancer 'stop sending me new requests', existing requests complete (typically within the request's normal duration), then the pod exits. The drain window is typically 30-60 seconds. For workloads with long-lived connections (WebSocket, streaming), drains can take several minutes — production tracks active connection count and waits for it to reach zero before removal.

Fourth tradeoff: custom metrics. CPU and memory are blunt signals. The real signal for a web service is request rate or p99 latency; for a background worker, queue depth. Custom-metric scaling reads from the metrics pipeline (e.g., the queue depth from the message queue's own admin API) and feeds it to the proportional scaling formula with a per-workload target. The senior signal: recognize that CPU isn't always the right scaling signal, and propose custom-metric scaling for the workloads where CPU is misleading.

Common mistakes

  • Scaling only on CPU — for queue-driven workloads, queue depth is the real signal; for latency-sensitive web tiers, p99 latency matters more than CPU
  • No cooldown on scale-down — causes thrashing where you remove pods and immediately add them back
  • Forgetting connection draining — abruptly killing pods drops in-flight requests; production needs graceful SIGTERM with a drain window
  • Combining HPA and VPA on the same workload without coordination — they double-scale and conflict
  • Assuming cold-start is instant — image-pull and readiness-probe latency means new pods take 30-90 seconds; production needs warm pools or predictive scaling for spike-sensitive tiers

Likely follow-up questions

  • How would your design handle a sudden 10x traffic spike that lasts only 60 seconds? Pre-warm? Burst capacity? Accept degradation?
  • How would you scale a stateful workload (e.g., a sharded database) where adding a pod requires rebalancing data?
  • What changes if you're scaling across multiple regions and want to shift capacity between regions based on regional demand?
  • How would you detect a runaway scale-out — a bug that causes the autoscaler to keep adding pods indefinitely?
  • How would you implement scaling for GPU workloads where each pod costs 10-100x a CPU pod and cold-start is several minutes?

Practice Design Autoscaling System live with an AI interviewer

Free, no sign-up required. Get real-time feedback on your design.

Practice these live

Frequently asked questions

Is autoscaling just Kubernetes HPA?
Kubernetes HPA is one production implementation of the horizontal-scaling layer. Mention it by name to anchor the discussion, but a senior-bar answer covers the broader system: HPA + VPA + predictive + warm pools + custom metrics + drain handling. HPA out-of-the-box doesn't give you predictive or warm pools.
How is autoscaling different from load balancing?
Load balancing distributes incoming requests across the existing pods. Autoscaling changes how many pods exist. They work together: the load balancer routes traffic to the current pod set; the autoscaler grows or shrinks that pod set. Both are needed for elastic capacity.
Should I always use predictive scaling?
No. Predictive only helps when traffic has predictable patterns (daily peaks, weekly cycles). For chaotic event-driven traffic, predictive offers no signal and adds complexity. Use reactive + warm pools for those cases. The senior signal is knowing when each tool fits.
What is the most important concept for Design Autoscaling System?
Scale-down hysteresis. The naive design (scale up when metric high, scale down when metric low) oscillates in production. Cooldown, asymmetric bands, and step limits are the discipline that makes autoscaling stable. The senior signal hinges on whether the candidate proactively raises thrashing as a concern.