Design Service Mesh — System Design Interview Guide
Design Service Mesh is a system-design interview that asks you to build the infrastructure layer that handles east-west traffic between microservices: mTLS encryption between every service-to-service hop, dynamic routing rules, retries, circuit breaking, and observability — all without requiring application code changes. The hard part is the data-plane vs control-plane separation, certificate rotation at scale, and keeping sidecar overhead low.
By Alex Chen, Founder, InterviewChamp.AI · Last verified
Reported in interviews at
- Meta
- Lyft
- Uber
- Netflix
Sourced from Glassdoor, Levels.fyi, and Blind interview reports.
Functional requirements
- Intercept all inbound and outbound traffic for every service via a sidecar proxy, with zero application-code changes
- Encrypt every service-to-service call with mTLS using short-lived certificates
- Apply dynamic routing rules: canary deploys, blue-green, traffic-weighted A/B, fault injection
- Apply resilience policies: retries with backoff, circuit breakers, timeouts, outlier ejection
- Emit per-request observability: latency, status code, source/destination service, mTLS identity
- Enforce authorization policies: which services can call which other services (zero-trust network model)
Non-functional requirements
- Scale: ~10K services, ~100K sidecar instances, ~10M+ in-mesh requests/sec at peak
- Sidecar latency overhead: <2ms p99 added per hop (proxy + mTLS handshake amortized)
- Sidecar memory: <100 MB per sidecar at steady state
- Control-plane config push: new routing rules propagate to all sidecars within 30 seconds
- Certificate rotation: every service identity rotates daily without service disruption
- Availability: control-plane failure must not block data-plane traffic (sidecars run on last-known-good config)
Capacity estimation
Scale anchors: ~10K services, average 10 instances each = ~100K application pods. Each pod runs one sidecar = ~100K sidecar instances. Steady-state in-mesh traffic ~10M requests/sec across the fleet.
Resource cost: each sidecar uses ~50-100 MB RAM and ~5-10% of one CPU core under typical load. At 100K sidecars × 75 MB = ~7.5 TB cluster-wide RAM dedicated to sidecars. At 0.1 CPU per sidecar × 100K = ~10K CPU cores dedicated to mesh data plane. This is not free — the mesh tax is real and forces a sizing conversation upfront.
Latency overhead: a single hop pays ~1ms for the proxy hop + ~1ms amortized for the mTLS session. Two sidecars per service-to-service call (outbound sidecar on source pod, inbound sidecar on destination pod) means ~2-4ms added per hop. For a request that fans through 10 services, that's 20-40ms of mesh tax — significant. Production sidecars are optimized aggressively: connection pooling, mTLS session resumption, and zero-copy proxying keep this manageable.
Control-plane scale: ~10K services × ~100 routing rules per service = ~1M config objects. The control plane pushes config diffs to all 100K sidecars — naive design would melt under the broadcast load. Production designs use a pub/sub fan-out tree where each control-plane node serves ~1K sidecars, with diffs computed once and gossiped.
Certificate volume: 100K sidecars × daily rotation = ~100K new certs/day = ~1 cert/sec issued. Plus the CSR validation and old-cert revocation traffic. The internal CA must handle this volume; production setups use an issuer-hierarchy with workload CAs that hold short-lived intermediate certs from a long-lived root.
High-level design
Two-layer architecture: data plane (the sidecars) and control plane (the management brain).
Data plane: every application pod runs a sidecar proxy (a Envoy-style or Linkerd-proxy-style L4/L7 proxy) in the same network namespace as the application container. Traffic is transparently intercepted: iptables rules (or eBPF in newer setups) redirect outbound traffic from the application to the sidecar's outbound port, and inbound traffic to the sidecar's inbound port. The application talks plain HTTP/gRPC over localhost; the sidecar adds mTLS, routing, and observability on the wire.
For each outbound request: the sidecar resolves the destination service via the mesh service discovery (a name like checkout.production.svc resolves to a list of pod IPs), picks an endpoint per the configured load-balancing policy, establishes an mTLS connection (or reuses a pooled one), applies routing rules (e.g., 'send 5% of /api/v2/* traffic to canary'), and forwards.
For each inbound request: the sidecar terminates mTLS, validates the source identity against authorization policies, applies rate limits and quotas, and forwards to the local application.
Control plane: a fleet of services that owns the source-of-truth config and pushes it to sidecars. Components: service registry (which services exist, which pods are healthy), policy store (routing rules, mTLS policies, authz rules), and certificate authority (issues short-lived certs to sidecars). Sidecars connect to the control plane over a streaming RPC channel (a gRPC subscription) and receive config diffs as they change.
Config push is incremental: when a routing rule changes, only the affected sidecars receive the diff (typically the sidecars on the service consumed by the rule). Full-fleet broadcasts are reserved for rare global policy changes.
Key insight: data plane never blocks on control plane. If the control plane is unreachable, sidecars run on their last cached config. Certificates have a buffer (e.g., 24h validity even though they're rotated every 12h) so a brief control-plane outage doesn't expire active mTLS sessions. The mesh is designed to fail open on management, not on data path.
Deep dive — the hard problem
Two deep dives: mTLS certificate rotation, and the data-plane vs control-plane separation discipline.
mTLS certificate rotation at scale is the production headache. Every sidecar holds a workload certificate that identifies its service (the SPIFFE-style identity, e.g., spiffe://platform.internal/ns/production/sa/checkout). Certificates are short-lived (24h-72h typical) to limit blast radius if a workload is compromised.
The rotation pipeline: each sidecar generates a fresh key pair locally, builds a Certificate Signing Request, sends it to the workload CA, and receives a signed cert. Validation: the CA checks the requester's identity using the pod's service-account token (provided by the orchestrator) and signs only for the identity that matches the token.
Gotchas. First: clock skew. mTLS validation includes notBefore/notAfter checks; a sidecar with a clock 30 seconds ahead can reject a freshly-issued cert from a CA with the correct clock. Production setups use clock-skew tolerance (typically 60-300 seconds) on cert validity windows.
Second: in-flight connections during rotation. When a sidecar rotates its cert, existing mTLS sessions keep using the old cert (sessions are pinned to the cert that established them). New sessions use the new cert. Both must be trusted by the peer; this requires the peer's trust store to include certs signed by the current AND recent past root keys. Trust-anchor rotation is the rarest and most dangerous operation — typically done once every few years, with a multi-week overlap window.
Third: revocation. CRL and OCSP are operationally painful at this scale (a sidecar would need to check revocation on every handshake). Production mTLS at this scale relies on short-lived certs instead of revocation — if a workload is compromised, you wait at most one rotation interval (24h) for the bad cert to expire. For faster response, the CA can push a revocation list directly to the data plane via the existing control-plane channel.
Data-plane vs control-plane separation is the discipline that makes the mesh survivable. The control plane is allowed to be slow (config changes within 30 seconds is plenty); it's allowed to be down briefly (sidecars fall back to cached config). The data plane must NEVER take a control-plane dependency in the hot path.
This means: no sidecar makes a synchronous call to the control plane to authorize a request. The authz policy is pushed to the sidecar; the sidecar enforces locally. Same for routing rules, rate limits, and mTLS validation. The control plane is a config-distribution system, not a request-processing system.
When this discipline slips, the consequences are catastrophic. A real-world example: a mesh implementation early in its history made every authz decision a synchronous call to a central policy server. When the policy server slowed down at peak load, every in-mesh request slowed with it, and the slowness cascaded back into the policy server (because the policy server's own traffic ran through the mesh). The fix was to push policies to sidecars and enforce locally; the lesson is core mesh design philosophy.
Third tradeoff: sidecar vs sidecarless. The pure sidecar model has a per-pod cost (memory and CPU). Newer designs explore 'ambient mesh' or 'sidecarless' patterns where mTLS termination happens at a per-node L4 proxy and L7 features at a separate per-namespace shared proxy. Trade resource cost for some loss of per-pod isolation. Mention this as an emerging tradeoff — interviewers reward awareness of the direction the industry is moving.
Fourth tradeoff: observability cost. Every sidecar emits per-request metrics (RED: requests, errors, duration). At 10M req/sec × 5 dimensions, the raw metric volume is staggering. Production setups aggregate at the sidecar (e.g., per-minute histograms instead of per-request events) and ship aggregates to the metrics pipeline. Don't propose per-request metric emission directly to a centralized store — it doesn't scale.
Common mistakes
- Treating the mesh as application-code instead of infrastructure — the whole point is zero app-code change; if your design requires app-code instrumentation you've reinvented a library
- Skipping the data-plane vs control-plane separation — putting authz decisions in the request hot path melts the system at scale
- Forgetting mTLS rotation strategy — naive certs that never rotate are a security antipattern; describe short-lived certs explicitly
- Ignoring sidecar resource cost — at 100K sidecars × 75 MB each, the mesh tax is a real budget item that drives sizing
- Proposing per-request metric events emitted to a central store — aggregate at the sidecar; ship aggregates, not raw events
Likely follow-up questions
- How would you migrate an existing 1000-service platform onto the mesh without a big-bang cutover?
- How would the mesh handle a service that talks a non-HTTP protocol (e.g., a custom binary protocol over TCP)?
- What changes if you need to extend the mesh across multiple Kubernetes clusters or multiple data centers?
- How would you debug a service-to-service latency spike — is it the application, the sidecar, the network, or the mesh control plane?
- How would you implement cross-cluster federation where service A in cluster X calls service B in cluster Y, with mTLS preserved end-to-end?
Practice Design Service Mesh live with an AI interviewer
Free, no sign-up required. Get real-time feedback on your design.
Practice these liveFrequently asked questions
- Is Service Mesh the same as API Gateway?
- No — different traffic directions. API gateway handles north-south traffic (external clients calling into the platform). Service mesh handles east-west traffic (services inside the platform calling each other). They share concerns (auth, observability, routing) but the volume profile, latency budget, and identity model are different. Most production platforms run both.
- Do I need to name a specific mesh product to answer this?
- Helpful but not required. Mentioning the Envoy-style data plane and naming the sidecar-proxy-plus-control-plane architecture anchors the discussion. Going deeper on specific projects (Istio, Linkerd) is bonus context but the design principles are universal.
- How does Service Mesh differ from a SDK-based RPC library?
- The SDK approach requires every application to link the library and stay updated; the mesh approach moves the concern out-of-process to a sidecar that the application doesn't know about. SDK upgrades require recompiling and redeploying every service (a multi-quarter coordination at 10K-service scale). Sidecar upgrades happen independently. The cost is the per-pod sidecar overhead.
- What is the most important concept for Design Service Mesh?
- Data-plane vs control-plane separation plus mTLS rotation discipline. The senior signal hinges on (a) the control plane is config-distribution, not request-processing, and (b) certificates are short-lived and rotated frequently with explicit handling of clock skew, in-flight sessions, and trust-anchor rollover.