Disaster Recovery Guide
RTO & RPO Explained
RTO
Recovery Time Objective
Maximum acceptable downtime. How quickly must the system be restored?
Example: "We must restore service within 4 hours of an outage."
RPO
Recovery Point Objective
Maximum acceptable data loss. How much data can we afford to lose?
Example: "We can lose at most 1 hour of transaction data."
DR Strategy Tiers
| Strategy | RTO | RPO | Cost | Description |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | Low | Simple backups to cold storage |
| Pilot Light | 30โ60 min | Minutes | Medium | Core services running, scaled up on disaster |
| Warm Standby | Minutes | Seconds | Medium-High | Scaled-down replica running always |
| Active-Active | <1 min | ~0 | High | Full redundancy in multiple regions |
DR Runbook Checklist
- โ Document all critical systems and dependencies
- โ Define RTO and RPO for each service tier
- โ Automate backup testing (restore quarterly)
- โ Conduct annual DR drills
- โ Maintain an out-of-band communication channel
- โ Document step-by-step failover procedures