Design a System for Distributed Tracing (Jaeger/Zipkin)

System Design
Medium
Google
25.1K views

Design a system to trace a single request across hundreds of microservices. Focus on span collection, sampling, and visualization for debugging performance bottlenecks.

Why Interviewers Ask This

Interviewers ask this to evaluate your ability to design scalable, high-performance systems that handle massive data ingestion without blocking user requests. At Google, they specifically test if you understand the trade-offs between consistency and availability in distributed environments, and whether you can implement efficient sampling strategies to manage storage costs while retaining critical debugging data.

How to Answer This Question

1. Clarify requirements by defining scale (requests per second), retention policies, and latency constraints typical of Google's infrastructure. 2. Propose a high-level architecture involving client-side instrumentation, an agent for aggregation, and a centralized collector service. 3. Detail the span lifecycle: generation at microservices, propagation via context headers (like W3C Trace Context), and transmission to collectors. 4. Discuss sampling strategies, contrasting head-based sampling for critical paths with tail-based sampling for error analysis, explaining how to reduce load on downstream storage. 5. Address visualization and querying, suggesting a columnar database like Bigtable or Spanner for fast lookups, and explain how to index traces for efficient filtering by service or latency thresholds.

Key Points to Cover

  • Explicitly define the difference between head-based and tail-based sampling strategies
  • Explain how asynchronous batching prevents the tracing system from blocking production code
  • Propose a storage solution optimized for write-heavy workloads and fast read queries
  • Demonstrate understanding of context propagation mechanisms like W3C Trace Context
  • Address cost implications of storing full trace data versus sampled data

Sample Answer

To design a distributed tracing system like Jaeger for Google-scale services, I would start by establishing the core components: instrumentation agents, collectors, and a storage backend. First, each microservice generat…

Common Mistakes to Avoid

  • Focusing only on the UI visualization while ignoring the heavy data ingestion pipeline
  • Suggesting synchronous tracing which would introduce unacceptable latency to user requests
  • Overlooking the need for sampling, leading to a proposal that cannot scale to billions of daily requests
  • Ignoring context propagation, making it impossible to link spans across different microservices

Sound confident on this question in 5 minutes

Answer once and get a 30-second AI critique of your structure, content, and delivery. First attempt is free — no signup needed.

Try it free

Related Interview Questions

Browse all 190 System Design questionsBrowse all 145 Google questions