Design a System for Monitoring API Latency
Design a system to measure, aggregate, and alert on the latency and error rates of thousands of API endpoints. Focus on sampling vs. full-data collection.
Why Interviewers Ask This
Interviewers at Salesforce ask this to evaluate your ability to balance system reliability with cost efficiency. They specifically want to see if you understand that collecting every data point for thousands of endpoints is unsustainable. This question tests your judgment in choosing between sampling strategies and full-data collection while ensuring critical alerts trigger without overwhelming infrastructure.
How to Answer This Question
1. Clarify Requirements: Immediately define scale, such as handling millions of requests per second across thousands of endpoints, and identify key metrics like p99 latency and error rates. 2. Propose Architecture: Sketch a high-level flow involving API gateways, collectors, and storage. 3. Address Sampling vs. Full Data: This is the core challenge. Explain why full collection is too expensive and propose adaptive sampling based on traffic volume or error thresholds. 4. Detail Aggregation: Describe how to compute percentiles (p50, p95, p99) using sliding windows or sketches like HyperLogLog to save memory. 5. Define Alerting Logic: Outline rules for triggering notifications only when anomalies exceed specific baselines to prevent alert fatigue. 6. Discuss Trade-offs: Conclude by analyzing the accuracy loss from sampling versus the cost savings, showing you understand business constraints.
Key Points to Cover
- Explicitly rejecting full-data collection in favor of adaptive sampling to manage costs
- Using sketch algorithms or sliding windows for efficient percentile calculation
- Differentiating between normal traffic sampling and 100% capture for error scenarios
- Implementing anomaly detection rather than static thresholds to reduce false positives
- Aligning the solution with enterprise-scale needs typical of large platforms like Salesforce
Sample Answer
To design a monitoring system for thousands of API endpoints, I would start by defining our SLOs, specifically targeting p99 latency under 200ms. Collecting 100% of request data is prohibitively expensive and creates unn…
Common Mistakes to Avoid
- Suggesting full data collection for all requests, which ignores scalability and cost constraints
- Focusing solely on storage without explaining how to calculate percentiles efficiently at scale
- Ignoring the difference between average latency and tail latency (p99), which matters most for user experience
- Proposing a single centralized database for ingestion, creating a bottleneck instead of a distributed pipeline
Sound confident on this question in 5 minutes
Answer once and get a 30-second AI critique of your structure, content, and delivery. First attempt is free — no signup needed.