Skip to main content

System Design Questions

Design a Telemetry Pipeline — System Design Interview Guide

Design a Telemetry Pipeline is a system-design interview that asks you to build the ingestion, processing, storage, and query layer for observability data — logs, metrics, traces from a fleet of thousands of services. The hard part is cardinality control (metric series explode under naive tagging), retention tiering (recent data is hot, week-old data is warm, year-old data is cold), and balancing ingest throughput against query latency on the same store.

By Alex Chen, Founder, InterviewChamp.AI · Last verified

Reported in interviews at

  • Amazon
  • Google
  • Microsoft
  • Meta
  • Netflix

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

  • Ingest logs, metrics, and distributed traces from thousands of services across a fleet
  • Store each telemetry type in a queryable form with appropriate retention
  • Support real-time queries (last 5 minutes) and historical queries (last 30 days, last year)
  • Power dashboards: latency percentiles, error-rate trends, service-dependency maps
  • Power alerts: fire when a metric crosses a threshold or a log pattern appears
  • Distributed tracing: stitch spans from multiple services into one trace, queryable by trace ID

Non-functional requirements

  • Ingest throughput: ~10M+ events/sec at peak across the platform (logs are highest volume, traces next, metrics smallest by event count but largest by series count)
  • Query latency: <2s p95 for dashboard queries over last 15 minutes; <30s for historical queries over weeks of data
  • Alert latency: <30s from a metric crossing threshold to an alert firing
  • Retention: 30 days hot, 1 year warm, 7 years cold (regulated industries) — with cost dropping by 10x at each tier
  • Cost: telemetry infrastructure should be <5% of the cost of the applications it observes
  • Availability: ingest must never reject events; queries can degrade gracefully under load

Capacity estimation

Public-scale assumption for a major platform: ~5000 services, ~50K hosts/containers running them, ~10M events/sec aggregate telemetry. Breakdown by type: ~7M logs/sec, ~2M trace spans/sec, ~1M metric points/sec.

Log volume: avg log line ~500 bytes, so ~3.5 GB/sec raw, ~300 TB/day, ~9 PB/month. Compressed ~4-5x with columnar storage, so ~2 PB/month after compression. At 30 days hot retention, ~60 PB in the hot tier.

Metric volume: 1M points/sec at ~50 bytes per point (timestamp + value + series_id reference) = 50 MB/sec, ~4 TB/day, ~120 TB/month. Smaller than logs by 25x in bytes. But metric series count is the real problem — see deep dive on cardinality. A platform with 5000 services × 10 metrics each × moderate tag cardinality can produce ~10M-100M unique series, each storing a value every 10 seconds.

Trace volume: 2M spans/sec at ~1 KB per span (operation name, duration, parent span ID, tags) = 2 GB/sec, ~170 TB/day, ~5 PB/month. Trace-tail sampling typically retains 1-10% of full traces, so post-sampling is ~500 TB/month.

Ingest bandwidth: 10M events/sec at average ~500 bytes = 5 GB/sec aggregate. Distributed across ~100 ingest endpoints, ~50 MB/sec each — comfortable.

Query patterns: dashboards are ~95% of query volume but ~5% of compute (small time windows, pre-aggregated metrics). Historical queries are ~5% of volume but ~95% of compute (scanning weeks of log data). The store must serve both well.

High-level design

Four-stage pipeline: collect, buffer, process, store. Plus a separate query layer that fans out across the storage tiers.

Collect: a lightweight agent runs on every host or in every container. The agent tails log files, scrapes metric endpoints, and receives trace spans from instrumented application libraries. The agent batches events locally (~1 second or ~1 MB whichever first) and ships batches to an ingest endpoint over a persistent connection.

Buffer: ingest endpoints accept agent batches, do minimal validation, and write to a high-throughput append-only log (Kafka-style or equivalent). Buffering decouples ingest spikes from downstream processing — when the processing layer is slow, the buffer absorbs minutes-to-hours of backlog without dropping events. The buffer is partitioned by service or tenant for parallel downstream consumption.

Process: streaming workers consume from the buffer and route by event type. (1) Logs: parse structured fields, apply sampling rules (keep all error logs, sample 10% of info logs), enrich with metadata (region, service version), and write to the log store. (2) Metrics: deduplicate within the time window, downsample by aggregation function (sum, count, avg, p50/p95/p99) into time buckets (1-second, 1-minute, 1-hour), and write to the metric store. (3) Traces: buffer all spans of a trace until the trace is complete (the root span and all children seen), apply tail-based sampling (keep all error traces, sample 1% of successful), stitch into a unified trace, and write to the trace store.

Store: three specialized stores per telemetry type. (1) Log store: columnar layout (event_id, timestamp, service, level, message, structured_fields) partitioned by hour and service. Recent data on SSD, older data tiered to object storage. (2) Metric store: time-series database with per-series columns. Each unique (metric_name, tag_set) combination is one series. Series are organized for fast time-range scans. (3) Trace store: each complete trace stored as a tree of spans keyed by trace_id; secondary indexes on (service, operation, duration_bucket, error_status) for trace search.

Query layer: a router that fans out queries to the appropriate store. A 'show me errors in service X in last 5 minutes' query hits the log store. A 'show me p99 latency of API endpoint Y over the last week' hits the metric store. A 'show me the trace for request Z' hits the trace store. Cross-store queries (e.g., 'what logs were emitted during this trace') do a coordinated fanout with trace_id as the join key.

Alerting: a separate evaluation engine reads from the metric store on a 10-second cadence, evaluates alert rules (PromQL-style expressions), and fires alerts when a rule's condition is true. Alert state is persisted (firing, resolved) and downstream integrations (paging, chat) get notified on state changes.

Deep dive — the hard problem

Three deep dives: cardinality control, retention tiering, and the trace-tail-sampling problem.

Cardinality control: this is the silent killer of metrics systems. Naive tagging like `request_latency_ms{user_id, request_id}` creates one time series per (user, request) — millions of series, most with one or two data points. Every series has overhead (index entry, storage allocation, query-time scan cost). A platform that loses cardinality control bleeds disk and CPU on a long tail of one-shot series.

Production systems enforce cardinality limits at ingest. Each metric name has a configured maximum series count (say, 10K series per metric). When new tag combinations push the count over the limit, the metric is dropped or its high-cardinality tags are coerced to a placeholder ('user_id=__overflow__'). Operators monitor cardinality dashboards and rework instrumentation when limits are approached.

The rule of thumb: tags should be low-cardinality dimensions (region, service, environment, status_code). High-cardinality identifiers (user_id, request_id, trace_id) belong on logs or traces, not on metrics. Mention this rule explicitly — it's a frequent senior-level discussion point.

Retention tiering: not all telemetry data is queried equally. Recent data (last 15 minutes) gets hammered by dashboards and alerts; week-old data gets occasional incident investigation; year-old data gets quarterly compliance queries. The cost structure should match.

Hot tier (last 30 days): SSDs, optimized columnar layout, fully indexed. Query latency in seconds. High cost per TB. Holds all metrics, sampled logs, sampled traces.

Warm tier (1 year): object storage with compressed columnar files. Queries pull files into a cache and scan them. Latency in tens of seconds. Cost ~10x cheaper than hot.

Cold tier (1+ year, compliance retention): archival object storage with much higher retrieval latency (minutes to hours). Cost ~10x cheaper than warm. Used only for regulatory queries that aren't time-sensitive.

Migration between tiers is automated. A background job re-encodes hot-tier data into warm-tier format and deletes hot copies after the warm copy is verified. Cold-tier migration similarly re-encodes and archives. The store presents queries with a unified view; the query layer fans out to whichever tier holds the requested time range.

Downsampling: high-frequency metrics (1-second resolution) don't stay at 1-second resolution forever. After a few hours, they downsample to 1-minute resolution. After a few days, to 5-minute. After a few weeks, to 1-hour. Each downsampling reduces storage by 60x for the lost resolution. The dashboard query layer picks the appropriate resolution for the time window (a 15-minute dashboard uses 1-second; a 1-month dashboard uses 1-hour).

Trace-tail sampling: distributed traces are voluminous and most aren't interesting. A boring trace of a successful 50ms API call adds no value to incident investigation. But you can't decide whether a trace is interesting until the trace is complete — the error span might be deep in the call tree, hours after the root span started.

Production systems use tail-based sampling: buffer all spans of a trace until the trace is complete (root span + all descendants seen, or a timeout), then apply sampling rules to the complete trace. Rules typically keep: all error traces, all slow traces (>p99 latency), a 1-10% sample of normal traces. Discarded traces don't write to storage at all.

The buffering is the cost. A trace with N services can have spans arriving over minutes (long-running async work). The buffer holds all in-flight traces until completion. This means the trace processing layer needs RAM proportional to in-flight trace count, not throughput. At 100K traces in flight × 10 spans × 1 KB = 1 GB buffer — fits in memory comfortably for a single processor, but as traces span longer or branch wider, the buffer grows.

Mature systems use a hybrid: head-based sampling at instrumentation time (a fraction of incoming requests are flagged 'trace this end-to-end' upstream) plus tail-based sampling on completed traces for the kept-flagged ones. This caps the trace volume at ingest while preserving error/slow-trace retention.

Fourth tradeoff: log ingestion is the most voluminous and least structured. Application code writes free-form log lines; the system can't pre-validate cardinality the way it can for metrics. The mitigation is sampling and tiered cost — info-level logs sample at 10%, warning logs at 50%, error logs at 100%. Sampling decisions are made at the agent (before bandwidth) when possible, at the buffer otherwise. Naming the sampling tier explicitly is the senior signal.

Common mistakes

  • Storing all telemetry types in one store — logs, metrics, and traces have wildly different access patterns and need separate stores
  • Skipping cardinality control — high-cardinality tags on metrics destroy the time-series store at scale
  • Treating retention as one tier — without hot/warm/cold tiering, the cost is 10x what it needs to be
  • Head-only trace sampling — head sampling discards traces that turn out to be errors after the root span; tail sampling is required for error retention
  • Forgetting the buffer between ingest and processing — without it, ingest spikes drop events when downstream is slow

Likely follow-up questions

  • How would you support a 'live tail' feature where an engineer watches logs from a service as they're emitted?
  • What changes if the platform must redact PII from logs before storage (GDPR-style data handling)?
  • How would you correlate a customer-reported error to the exact trace and logs that caused it?
  • How would you handle a service that suddenly increases its log volume 100x due to a bug — without that service taking down the telemetry platform?
  • How would you implement a query language that lets engineers join logs and metrics and traces in one query?

Practice Design a Telemetry Pipeline live with an AI interviewer

Free, no sign-up required. Get real-time feedback on your design.

Practice these live

Frequently asked questions

How long is the Design Telemetry Pipeline interview?
60-75 minutes is typical at observability-focused teams. Expect deep questions on cardinality, retention tiers, and the trace-tail-sampling problem. The candidate is expected to know that logs/metrics/traces are separate stores, not one unified store.
Do I need to know specific telemetry-platform names?
No. The interview is about the patterns: collect → buffer → process → store, with cardinality control and retention tiering. Naming the underlying database (TSDB, columnar log store) is enough; the specific commercial product is irrelevant to the design.
How is this different from a generic data pipeline?
Telemetry has unique constraints. (1) Ingest must never reject — losing observability during an incident is catastrophic. (2) Cardinality control matters because metric series count, not byte volume, drives index cost. (3) Trace tail sampling is unique to traces — you can't decide what to keep until the trace is complete. Generic data pipelines don't face these in the same way.
What is the most important concept for Design Telemetry Pipeline?
Three things: separate stores per telemetry type, cardinality control on metrics, and tail sampling on traces. The senior signal is recognizing that the three telemetry types have different access patterns and cost structures, and explaining the cardinality limit and trace-buffering mechanisms explicitly.