Design a Notification Service (Push/SMS/Email)

System Design
Medium
Meta
134.8K views

Design a generalized system for sending millions of notifications daily. Discuss queuing (Kafka/RabbitMQ), delivery guarantees, and handling failure modes.

Why Interviewers Ask This

Interviewers at Meta ask this to evaluate your ability to design scalable, fault-tolerant distributed systems under high load. They specifically assess how you balance throughput with delivery guarantees when handling millions of daily events. The question tests your understanding of asynchronous processing, message broker selection, and strategies for managing partial failures without blocking the entire system.

How to Answer This Question

1. Clarify requirements immediately: Define scale (requests per second), latency tolerance, and specific delivery guarantees (at-most-once vs. at-least-once) required by different channels like push versus email. 2. Outline the high-level architecture: Propose a flow where client requests hit an API gateway, which enqueues messages into a robust broker like Kafka rather than sending directly. 3. Detail the consumer logic: Explain how worker groups consume from topics, transform data for specific providers (e.g., Firebase for push, Twilio for SMS), and handle retries with exponential backoff. 4. Address failure modes: Discuss idempotency keys to prevent duplicate sends during network glitches and dead-letter queues for unresolvable errors. 5. Optimize for cost and reliability: Mention sharding strategies for partitioning traffic and monitoring metrics like lag and error rates to ensure SLA compliance.

Key Points to Cover

  • Explicitly defining the trade-off between latency and delivery guarantees (at-most-once vs at-least-once)
  • Using a message broker like Kafka to decouple producers from consumers for scalability
  • Implementing idempotency keys to prevent duplicate notifications during retries
  • Designing a Dead Letter Queue (DLQ) pattern for handling permanent failures
  • Addressing backpressure and auto-scaling mechanisms to handle traffic spikes

Sample Answer

To design a notification service capable of handling millions of daily events, I would prioritize decoupling ingestion from delivery using Apache Kafka. First, we define the scope: if we need strict delivery guarantees f…

Common Mistakes to Avoid

  • Suggesting synchronous HTTP calls for every notification, which creates bottlenecks and blocks user threads
  • Ignoring idempotency, leading to users receiving multiple copies of the same alert during network retries
  • Failing to distinguish between different notification types, treating all messages with identical priority and storage needs
  • Overlooking the need for a Dead Letter Queue, leaving failed messages stuck in the main processing loop

Sound confident on this question in 5 minutes

Answer once and get a 30-second AI critique of your structure, content, and delivery. First attempt is free — no signup needed.

Try it free

Related Interview Questions

Browse all 173 System Design questionsBrowse all 71 Meta questions