Design a Distributed Cron/Scheduler Service

System Design
Medium
Google
99.3K views

Design a highly available service to run scheduled tasks across a cluster of machines. Discuss distributed locks and failure detection for tasks.

Why Interviewers Ask This

Interviewers at Google ask this to evaluate your ability to design resilient distributed systems under failure conditions. They specifically assess how you handle race conditions, ensure exactly-once execution semantics, and manage leader election without a single point of failure. This question reveals if you can balance consistency with availability when coordinating tasks across unreliable network nodes.

How to Answer This Question

1. Clarify requirements: Define scale (tasks per second), latency tolerance, and whether 'at-least-once' or 'exactly-once' execution is needed. 2. High-level architecture: Propose a centralized coordinator or a peer-to-peer model using consensus algorithms like Raft or Paxos for state management. 3. Task distribution: Explain how jobs are queued, assigned to workers, and how the system handles worker failures via heartbeat mechanisms. 4. Distributed locking: Detail strategies like Redis-based locks or ZooKeeper ephemeral nodes to prevent duplicate task execution. 5. Failure detection: Describe how the system detects dead workers and re-schedules their tasks, ensuring data integrity throughout the process.

Key Points to Cover

  • Explicitly stating the choice between at-least-once versus exactly-once execution semantics
  • Proposing a consensus algorithm like Raft to eliminate single points of failure
  • Describing specific distributed locking mechanisms such as ephemeral nodes or optimistic locking
  • Explaining how heartbeat intervals and timeouts trigger automatic task reassignment
  • Mentioning idempotency as a critical defense against duplicate executions during retries

Sample Answer

To design a highly available distributed cron service, I would start by defining the core requirement: executing scheduled tasks exactly once even if nodes fail. For the architecture, I'd avoid a single central server to…

Common Mistakes to Avoid

  • Ignoring clock synchronization issues which can cause premature or delayed task triggers across nodes
  • Failing to define what happens if the coordination service itself goes down during a critical window
  • Overlooking the need for idempotent task design, leading to data corruption on retry scenarios
  • Designing a monolithic central scheduler that creates a bottleneck and violates high availability principles

Sound confident on this question in 5 minutes

Answer once and get a 30-second AI critique of your structure, content, and delivery. First attempt is free — no signup needed.

Try it free

Related Interview Questions

Browse all 190 System Design questionsBrowse all 145 Google questions