Design a System for Monitoring Service Health

System Design
Medium
Salesforce
148.5K views

Design a system to collect metrics and check the health status of thousands of microservices. Discuss pull vs. push models (Prometheus vs. StatsD).

Why Interviewers Ask This

Interviewers at Salesforce ask this to evaluate your ability to design scalable, reliable monitoring systems for complex microservice architectures. They specifically assess your understanding of data collection strategies, trade-offs between pull and push models, and how to handle high-volume metrics without overwhelming the system or losing critical health signals during outages.

How to Answer This Question

1. Clarify requirements: Ask about scale (thousands of services), latency tolerance, and whether real-time alerting is needed versus batch analysis. 2. Define the architecture: Propose a layered approach involving agents on services, a central ingestion layer, time-series storage, and an alerting engine. 3. Compare collection models: Explicitly contrast Prometheus's pull model (good for reliability and scraping) against StatsD's push model (low overhead but riskier under load). 4. Address scalability: Discuss sharding strategies, cardinality limits, and handling network partitions in a distributed environment like Salesforce's ecosystem. 5. Conclude with resilience: Explain how the system ensures it remains operational even when monitored services are failing, perhaps by using local buffering before sending metrics.

Key Points to Cover

  • Explicitly comparing the reliability of pull models against the efficiency of push models
  • Demonstrating awareness of cardinality explosion risks in high-scale systems
  • Proposing a resilient architecture that buffers data during service failures
  • Addressing horizontal scaling strategies for the ingestion and storage layers
  • Connecting technical choices to business needs like reduced alert fatigue

Sample Answer

To design a system for monitoring thousands of microservices, I would first clarify that we need sub-second visibility into health status with minimal overhead. The architecture should start with lightweight sidecar agen…

Common Mistakes to Avoid

  • Focusing solely on one tool without explaining the underlying architectural trade-offs
  • Ignoring the impact of high cardinality labels on storage costs and query performance
  • Assuming the monitoring system itself cannot fail and not designing for self-healing
  • Overlooking the difference between synchronous and asynchronous metric collection in high-load scenarios

Sound confident on this question in 5 minutes

Answer once and get a 30-second AI critique of your structure, content, and delivery. First attempt is free — no signup needed.

Try it free

Related Interview Questions

Browse all 173 System Design questionsBrowse all 49 Salesforce questions