Design a Disaster Recovery Plan

Question

Accepted Answer

To design a robust DR plan for a multi-region deployment, I first align with stakeholders to define strict RPO and RTO goals. For a customer-facing payment service, we might target an RPO of less than one minute and an RTO under five minutes. Architecturally, I would implement an active-active pattern across two regions using DynamoDB Global Tables for synchronous cross-region replication, ensuring data consistency is maintained within milliseconds.

For traffic management, Amazon Route53 health checks will automatically route users to the healthy region upon detecting latency spikes or endpoint failures. Crucially, we must eliminate manual steps; automated scripts using AWS Lambda will initiate the failover process immediately when the primary region becomes unreachable.

However, a plan is only as good as its testing. I propose a monthly Game Day exercise where we intentionally shut down the primary region in a staging environment. This validates our RTO claims and uncovers configuration drifts. We would also use chaos engineering tools to inject network partitions, ensuring our circuit breakers and retry logic function correctly. Finally, we document every test result and update runbooks continuously. This approach ensures that even during a catastrophic event, customer experience remains uninterrupted, directly supporting Amazon's principle of Customer Obsession.

Design a Disaster Recovery Plan

Why Interviewers Ask This

How to Answer This Question

Key Points to Cover

Sample Answer

Common Mistakes to Avoid

Sound confident on this question in 5 minutes

Related Interview Questions

Design a CDN Edge Caching Strategy

Design a System for Monitoring Service Health

Design a Payment Processing System