Design a Distributed Queue (Kafka/SQS)

System Design
Medium
Amazon
24K views

Explain the architecture of a distributed message queue. Discuss consumers, producers, partitioning, offset management, and at-least-once vs. exactly-once delivery.

Why Interviewers Ask This

Interviewers at Amazon ask this to evaluate your ability to design scalable, fault-tolerant systems that handle high-throughput data streams. They specifically assess your understanding of trade-offs between consistency and availability, your grasp of partitioning strategies for parallelism, and your knowledge of delivery semantics like at-least-once versus exactly-once in real-world distributed environments.

How to Answer This Question

1. Clarify requirements: Ask about throughput, latency needs, retention policies, and whether ordering matters within partitions or globally. 2. Define core components: Briefly outline producers, brokers, consumers, and the underlying storage mechanism like log-structured merge trees. 3. Explain partitioning strategy: Describe how topics are split into partitions for parallelism and how keys determine message routing. 4. Discuss offset management: Explain how consumers track progress via committed offsets and what happens during rebalancing. 5. Address delivery guarantees: Compare at-least-once (idempotency required) vs. exactly-once (transactional writes), noting Amazon's preference for practical solutions over theoretical perfection.

Key Points to Cover

  • Explicitly mention partitioning as the mechanism for horizontal scalability and parallel consumption
  • Explain offset tracking as the foundation for consumer progress and fault tolerance
  • Distinguish clearly between at-least-once (practical) and exactly-once (complex) delivery models
  • Reference Amazon's focus on idempotency over complex transactional guarantees
  • Discuss rebalancing protocols and how they impact consumer group stability

Sample Answer

To design a distributed queue like Kafka or SQS, I start by clarifying requirements. For Amazon-scale workloads, we need high throughput, low latency, and durability. The system consists of producers sending messages to…

Common Mistakes to Avoid

  • Confusing topic-level ordering with partition-level ordering, leading to incorrect assumptions about global sequence
  • Ignoring the role of replication factors when discussing durability and fault tolerance
  • Overcomplicating exactly-once semantics without acknowledging the operational overhead it introduces
  • Failing to address what happens during consumer failures or cluster node outages

Sound confident on this question in 5 minutes

Answer once and get a 30-second AI critique of your structure, content, and delivery. First attempt is free — no signup needed.

Try it free

Related Interview Questions

Browse all 190 System Design questionsBrowse all 184 Amazon questions