Design a Cloud Cost Monitoring and Alerting System

System Design
Medium
Spotify
41K views

Design a service to track cloud spending (AWS/Azure) across different teams/projects, predict future spend, and alert on sudden spikes.

Why Interviewers Ask This

Interviewers at Spotify ask this to evaluate your ability to balance cost efficiency with engineering velocity. They want to see if you can design a system that provides real-time visibility across distributed teams without becoming a bottleneck. This tests your understanding of cloud billing APIs, data aggregation strategies, and your capacity to build scalable alerting mechanisms that prevent budget overruns while maintaining developer autonomy.

How to Answer This Question

1. Clarify Requirements: Immediately define scope, such as supporting AWS and Azure simultaneously, handling multi-team tags, and determining acceptable latency for alerts (e.g., near-real-time vs. daily). Ask about specific budget thresholds or SLOs for the monitoring service itself. 2. Define High-Level Architecture: Propose a pipeline starting with log ingestion from cloud providers, moving through a processing layer for normalization, and ending in a storage and visualization layer. Mention how you would handle data volume spikes during month-end billing cycles. 3. Detail Core Components: Discuss specific technologies like Kinesis or Kafka for streaming, a time-series database like Prometheus or InfluxDB for metrics, and a machine learning model for anomaly detection regarding spend spikes. 4. Address Edge Cases: Explain how you will handle missing tags, currency conversion, and false positives in alerting to avoid 'alert fatigue' for engineering managers. 5. Summarize Trade-offs: Conclude by discussing the balance between cost of the monitoring tool versus potential savings, ensuring the solution aligns with Spotify's culture of experimentation and ownership.

Key Points to Cover

  • Demonstrating knowledge of specific cloud provider APIs (AWS Cost Explorer, Azure Billing API) and their limitations
  • Proposing a decoupled, event-driven architecture to handle variable data ingestion loads effectively
  • Addressing the challenge of data normalization across multiple clouds and inconsistent tagging strategies
  • Incorporating predictive analytics or anomaly detection rather than just static threshold-based alerting
  • Designing for user experience by including self-service dashboards to empower individual engineering teams

Sample Answer

To design a Cloud Cost Monitoring and Alerting System suitable for Spotify's scale, I would start by clarifying our non-functional requirements: sub-hourly data freshness for critical alerts and support for both AWS and…

Common Mistakes to Avoid

  • Focusing solely on the UI or dashboard without explaining the underlying data pipeline and storage strategy
  • Ignoring the complexity of normalizing data from different cloud providers with different billing granularities
  • Over-engineering the solution with complex microservices when a simpler serverless approach might suffice initially
  • Failing to discuss how to handle false positives in alerting, which leads to engineers ignoring critical warnings

Sound confident on this question in 5 minutes

Answer once and get a 30-second AI critique of your structure, content, and delivery. First attempt is free — no signup needed.

Try it free

Related Interview Questions

Browse all 165 System Design questionsBrowse all 30 Spotify questions