Design a Distributed Job Scheduler (Cron Service)

System Design
Hard
Microsoft
23.8K views

Design a distributed system to schedule and execute millions of time-based jobs reliably. Discuss job persistence, handling worker failures, and preventing duplicate execution.

Why Interviewers Ask This

Interviewers at Microsoft ask this to evaluate your ability to design fault-tolerant, scalable systems under constraints. They specifically test your understanding of distributed consensus, idempotency, and how to handle clock skew across thousands of nodes while ensuring exactly-once execution semantics for critical workloads.

How to Answer This Question

1. Clarify requirements: Define scale (millions of jobs), latency tolerance, and consistency models immediately. 2. Propose a high-level architecture: Suggest a coordinator-based approach using a consensus protocol like Raft or a leader election mechanism for the scheduler master. 3. Detail persistence: Explain storing job metadata in a durable store like Azure Cosmos DB with a TTL index for cleanup. 4. Address reliability: Describe how to use distributed locks or atomic operations to prevent duplicate executions during worker failures. 5. Discuss scaling: Outline sharding strategies based on job IDs or time windows to distribute load evenly across worker nodes.

Key Points to Cover

  • Explicitly defining the trade-off between strong consistency for triggering and eventual consistency for execution
  • Proposing a specific consensus protocol like Raft for leader election to avoid split-brain scenarios
  • Detailing an idempotency strategy using unique transaction IDs to guarantee exactly-once delivery
  • Describing a sharding strategy based on time buckets or job IDs to handle millions of concurrent entries
  • Outlining a heartbeat and retry mechanism to recover from partial worker failures gracefully

Sample Answer

To design a distributed cron service capable of handling millions of jobs, I would start by decoupling scheduling from execution. The system needs a persistent queue where every job is stored with its next run time, ID,…

Common Mistakes to Avoid

  • Focusing solely on the code logic while ignoring the need for a distributed coordination layer like ZooKeeper or etcd
  • Overlooking clock skew issues between different nodes which can cause premature or delayed job execution
  • Designing a single-point-of-failure scheduler instead of implementing active-active redundancy or leader election
  • Neglecting to discuss how to handle duplicate executions when a worker crashes right after starting a job

Sound confident on this question in 5 minutes

Answer once and get a 30-second AI critique of your structure, content, and delivery. First attempt is free — no signup needed.

Try it free

Related Interview Questions

Browse all 190 System Design questionsBrowse all 107 Microsoft questions