Skip to main content

System Design Questions

Design Distributed Tracing System — System Design Interview Guide

Design Distributed Tracing System is a system-design interview that asks you to build the infrastructure that follows a single user request as it fans out across hundreds of microservices, captures timing and metadata at every hop, and lets engineers visualize the whole call tree in seconds. The hard part is propagating trace context with zero application-code cost, sampling at scale without losing the interesting traces, and storing tens of billions of spans per day affordably.

By Alex Chen, Founder, InterviewChamp.AI · Last verified

Reported in interviews at

  • Google
  • Meta
  • Amazon
  • Microsoft
  • Uber

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

  • Generate a unique trace ID for each user-facing request and propagate it through every downstream service call
  • Record a span per service operation: start_time, end_time, service_name, operation_name, parent_span_id, tags, logs
  • Provide a query API: lookup trace by ID, search traces by service + operation + tag filters, time-range filtered
  • Render the trace as a hierarchical waterfall view with timing breakdown per span
  • Optional: link traces to metrics (RED-method dashboards) and logs (jump from span to the matching log lines)
  • Optional: anomaly detection on per-operation latency distributions

Non-functional requirements

  • Scale: ~10 billion spans/day = ~120K spans/sec ingest, peak ~500K spans/sec
  • Application overhead: <1% CPU and <100µs latency added per instrumented call
  • Query latency: lookup-by-trace-id <500ms p99; search by tag <3s p99
  • Retention: 7 days hot, 30 days warm, optional 1 year cold archive for compliance
  • Availability: tracing infra failure must NEVER break the traced application (graceful degradation is non-negotiable)
  • Data loss tolerance: ~1% span loss acceptable in exchange for low overhead

Capacity estimation

Scale anchors: ~10K microservices, ~10M user-facing requests/sec at peak across the platform, ~100 spans per trace on average. That gives ~1B spans/sec at the upper bound — well beyond what's economical to store. This is why sampling exists.

With 1% head-based sampling: 10M req/sec × 1% × 100 spans = 10M spans/sec ingest. Average span size after compression ~500 bytes (service_name, operation, timestamps, tags, parent pointer). Daily ingest = 10M × 86,400 × 500 bytes = ~430 TB/day raw. Compressed at 5x = ~85 TB/day. Hot storage 7 days = ~600 TB; warm 30 days = ~2.5 PB. Significant but tractable on commodity hardware.

Reality is lower because most teams use 0.1% sampling for healthy traffic + 100% for errors + tail-based sampling for slow requests. After realistic sampling: ~120K spans/sec ingest, ~10 TB/day hot storage.

Query patterns: ~95% of queries are lookup-by-trace-id (an engineer pasting a trace ID from an error-monitoring alert or a customer support ticket). ~5% are tag-filter searches ('show me all traces where service=checkout AND error=true in the last hour'). Index design follows: a fast key-value index on trace_id, a separate search index on (service, operation, tag, timestamp) for the search path.

Network: collectors receive ~120K spans/sec × 500 bytes = ~60 MB/sec = ~480 Mbps aggregate. Single-digit collectors handle this with 10 Gbps NICs; in practice you run 50+ collectors for redundancy and geographic locality.

High-level design

Four pillars: instrumentation library, span collector, storage tier, and query/UI tier.

Instrumentation library is the foundation. Every application links a tracing library that hooks into the request lifecycle. On inbound HTTP/gRPC: extract trace context from headers (W3C Trace Context standard, e.g. traceparent header), or generate a new trace_id if absent. Start a span. On outbound calls: inject the current trace context into headers before sending. On call completion: end the span, attach status and tags, hand it off to the collector via an async buffer.

The library must be cheap: a span end is ~10 microseconds of CPU including serialization, and the async buffer ensures the application thread is never blocked waiting on collector availability. If the buffer overflows (collector down), the library drops spans silently rather than blocking the app — graceful degradation is mandatory.

Span collector is a fleet of stateless ingest services. Applications POST batches of spans (typically 100-1000 spans per batch, every 1-5 seconds). Collectors validate, decorate (add receive-time, infer service topology from parent/child pointers), and forward to a streaming queue. The queue (think a distributed log) absorbs burst traffic and decouples ingest from storage, so a storage hiccup doesn't backpressure into applications.

Storage tier consumes from the queue and writes spans to two stores. (1) A columnar trace store indexed by trace_id — every span for a given trace lives co-located so a lookup-by-trace-id reads a single row group. (2) A search index for tag-filter queries — only the high-cardinality fields are indexed (service, operation, status, latency bucket, error tag).

Query/UI tier serves two paths. Lookup-by-trace-id: fetch all spans for the trace_id, reconstruct the parent/child tree, render as a waterfall. Search-by-tag: query the search index, get matching trace_ids, then bulk-fetch the trace data. The UI is the engineer-facing surface — waterfall view, span detail panel, comparison between two traces. Most engineers spend their tracing time in lookup-by-trace-id mode after copying an ID from an error report.

Deep dive — the hard problem

Two deep dives: context propagation across boundaries, and sampling strategy.

Context propagation across boundaries is the foundation that breaks if you get it wrong. The trace context is a small bag of fields: trace_id (128 bits, globally unique), span_id (64 bits, identifies the current span), sampling decision, and optional trace state (vendor-neutral key/value baggage). The W3C Trace Context standard defines two HTTP headers: traceparent (the core fields) and tracestate (the optional baggage).

Propagation works cleanly for HTTP and gRPC because headers travel naturally. The hard cases are async boundaries. Three named gotchas.

Message queue boundaries: when service A publishes a message to a queue and service B consumes it minutes later, the trace context must travel through the queue. Convention: stuff the traceparent into the message metadata or headers, extract on consume, start a new 'follow-from' span pointing back to the producer's span. The producer-consumer relationship is shown in the waterfall as a temporal-gap link, not a parent-child.

Thread-pool and async-callback boundaries: when a request is handed off from one thread to another (e.g., async I/O completion, scheduled job), the language-level thread-local context disappears. The instrumentation library must explicitly capture the context at the dispatch point and restore it at the resume point. Most production tracing libraries provide explicit context-carrier APIs for this.

Third-party SDK boundaries: when application code calls a third-party SDK that doesn't propagate trace context (a payments SDK, an analytics SDK), the trace breaks at that hop. Standard practice: wrap the SDK call in a span, accept that the inside of the SDK is opaque, and rely on the third-party's own tracing if they offer it.

Sampling strategy is the other axis. Three classes of sampling.

Head-based sampling: at trace start, flip a biased coin (e.g., 1% probability) and propagate the decision in trace context. All services in the trace see the same decision and sample together. Pro: cheap, no coordination cost. Con: you commit to keeping or dropping the trace before you know if it's interesting (errored, slow). Most production setups start with head-based at 1-10% for normal traffic.

Tail-based sampling: every service captures every span and forwards to a sampling decision-maker. The decision-maker waits until the entire trace is complete (typically 30 seconds after trace start), then decides whether to keep based on properties: 100% of errors, 100% of slow traces (p99+), 1% of healthy traces. Pro: every interesting trace is captured. Con: requires buffering all spans for the decision window, which is memory-expensive (the decision-maker fleet is sized to hold 30 seconds of full-volume spans).

The pragmatic production pattern: head-based at ~10% as the baseline; force-sample (head_decision = 1) when the request enters with a specific debug header for ad-hoc investigation; tail-based augmentation only for the spans most likely to matter (error + slow). Mention this layered approach — interviewers reward the candidate who recognizes that pure head-OR-tail is a strawman.

Third tradeoff: storage tiering. Hot spans (last 7 days) go in fast columnar storage indexed for sub-second lookup. Warm spans (8-30 days) move to cheaper object storage with the index re-shaped for cheap-but-slower queries (a few seconds is acceptable for traces that old). Cold spans (30 days to 1 year, optional) go to archival storage; queries on cold data are minutes-to-hours and rare. Tier transitions run as background batch jobs.

Fourth tradeoff: trace ID cardinality vs query performance. trace_id is high-cardinality (billions of unique values) — it makes a great primary key but a terrible search dimension. The search index should NOT include trace_id; it should index (service, operation, time-bucket, error, latency_bucket) and return matching trace_ids for fan-out to the primary store. Some early-stage tracing systems shipped with trace_id in the search index and crashed under tag-search load — the senior signal is recognizing this index separation upfront.

Common mistakes

  • Proposing 100% sampling at scale — the storage cost is impossible; every production tracing system samples aggressively
  • Forgetting graceful degradation — if the tracing collector is down, the application MUST NOT block or crash; this is a hard requirement
  • Treating trace context as an application-code concern instead of a library concern — every team would re-implement propagation incorrectly
  • Skipping the async/queue boundary discussion — interviewers consistently probe how traces survive a message queue hop
  • Putting trace_id in the search index — it's high-cardinality and a terrible filter; the search index belongs on operation/service/tag/error/latency_bucket

Likely follow-up questions

  • How would you handle a trace that fans out to 10,000 spans (deep batch job processing) — does it still fit in your waterfall view?
  • How would you support cross-organization tracing (a trace that spans your platform plus a third-party API that has its own tracing system)?
  • How would you debug a 'broken' trace where some spans are missing — is the gap a sampling drop, a propagation bug, or a collector failure?
  • What changes if you need to track PII-tagged spans differently (e.g., spans touching healthcare data need stricter retention)?
  • How would you build a comparison view that shows the difference between two traces (one fast, one slow) for the same operation?

Practice Design Distributed Tracing System live with an AI interviewer

Free, no sign-up required. Get real-time feedback on your design.

Practice these live

Frequently asked questions

Is Design Distributed Tracing the same as Design Logging?
No. Logs are unstructured per-line events with no causal relationship; traces are structured spans with explicit parent/child relationships forming a tree per request. The two systems share storage tier concerns but the data model and query patterns are very different.
Do I need to know specific OSS tracing tools to answer this?
Helpful but not required. Mentioning the W3C Trace Context standard and naming the OpenTelemetry-style abstraction (instrumentation library + collector + storage + UI) is enough to anchor the discussion. Going deeper on specific projects (Jaeger, Zipkin, Tempo) is bonus context.
How is sampling different from rate-limiting?
Rate-limiting caps how many requests the application serves; sampling caps how many traces the tracing system stores. Sampling decisions happen inside instrumented code and are usually probabilistic; rate-limiting decisions happen at the entry point and are usually deterministic per-tenant. Both shed load but at different layers.
What is the most important concept for Design Distributed Tracing?
Context propagation plus sampling strategy. The senior signal hinges on (a) explaining how trace context survives an async/queue boundary and (b) proposing a layered sampling approach (head-based baseline + tail-based augmentation for errors and slow traces).