BCDR Planning Guide
BCDR Planning Guide
Comprehensive guide to Business Continuity and Disaster Recovery (BCDR) planning to ensure your organization can maintain operations and recover quickly from disruptions.
BCDR Fundamentals
Business Continuity and Disaster Recovery planning ensures your organization can continue operating during disruptions and recover critical systems within acceptable timeframes.
Key Concepts
- RTO (Recovery Time Objective): Maximum acceptable downtime
- RPO (Recovery Point Objective): Maximum acceptable data loss
- MTTR (Mean Time To Recovery): Average recovery time
- BIA (Business Impact Analysis): Impact assessment of disruptions
- DR Tiers: Different levels of recovery capabilities
Business Impact Analysis
Critical System Classification
Tier | Description | RTO | RPO | Examples |
---|---|---|---|---|
Tier 0 | Mission Critical | < 1 hour | < 15 minutes | Payment processing, core database |
Tier 1 | Business Critical | < 4 hours | < 1 hour | Customer portal, order system |
Tier 2 | Business Important | < 24 hours | < 4 hours | Internal tools, reporting |
Tier 3 | Business Supporting | < 72 hours | < 24 hours | Development, testing systems |
Risk Assessment Matrix
# Risk calculation formula
Risk Score = Likelihood × Impact
# Risk categories
- Natural disasters: Earthquakes, floods, fires
- Technical failures: Hardware failure, software bugs
- Human errors: Misconfigurations, accidental deletion
- Security incidents: Ransomware, data breaches
- Infrastructure: Power outages, network failures
# Mitigation priorities
High Risk (Score 16-25): Immediate action required
Medium Risk (Score 9-15): Plan mitigation within 30 days
Low Risk (Score 1-8): Monitor and review quarterly
DR Architecture Patterns
1. Backup and Restore
# Cost-effective for Tier 2-3 systems
# RTO: Hours to days, RPO: Hours
# Implementation
- Regular automated backups to offsite location
- Documented restore procedures
- Periodic restore testing
# AWS Example
aws s3 sync /data s3://dr-backup-bucket/data --delete
aws backup create-backup-plan --backup-plan file://backup-plan.json
2. Pilot Light
# Minimal DR footprint for Tier 1-2 systems
# RTO: Hours, RPO: Minutes to hours
# Architecture
resource "aws_instance" "dr_database" {
ami = data.aws_ami.database.id
instance_type = "t3.small" # Minimal size
# Scale up during DR activation
lifecycle {
ignore_changes = [instance_type]
}
}
# Continuous data replication
resource "aws_db_instance" "primary" {
# ... primary configuration
# Enable automated backups for restore
backup_retention_period = 7
backup_window = "03:00-04:00"
}
resource "aws_db_instance_replica" "dr" {
replicate_source_db = aws_db_instance.primary.id
instance_class = "db.t3.small" # Minimal size
}
3. Warm Standby
# Scaled-down running environment for Tier 0-1
# RTO: Minutes, RPO: Seconds
# Kubernetes multi-region setup
apiVersion: v1
kind: Service
metadata:
name: global-app-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
type: LoadBalancer
selector:
app: myapp
ports:
- port: 443
targetPort: 8443
---
# Primary region deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-primary
namespace: production
spec:
replicas: 10
selector:
matchLabels:
app: myapp
region: primary
---
# DR region deployment (reduced capacity)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-dr
namespace: production-dr
spec:
replicas: 3 # Scale up during DR
selector:
matchLabels:
app: myapp
region: dr
4. Active-Active Multi-Region
# Full redundancy for Tier 0 systems
# RTO: 0, RPO: Near 0
# Traffic management with Route53
resource "aws_route53_health_check" "primary" {
fqdn = "primary.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = "3"
request_interval = "30"
}
resource "aws_route53_record" "www" {
zone_id = aws_route53_zone.main.zone_id
name = "www.example.com"
type = "A"
set_identifier = "Primary"
health_check_id = aws_route53_health_check.primary.id
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
failover_routing_policy {
type = "PRIMARY"
}
}
# Database multi-master replication
CREATE PUBLICATION myapp_pub FOR ALL TABLES;
-- On DR region
CREATE SUBSCRIPTION myapp_sub
CONNECTION 'host=primary.region.rds.amazonaws.com dbname=myapp'
PUBLICATION myapp_pub
WITH (copy_data = false, synchronous_commit = 'remote_apply');
Backup Strategies
3-2-1 Backup Rule
- 3 copies of important data
- 2 different storage media types
- 1 offsite backup copy
Automated Backup Implementation
#!/bin/bash
# Comprehensive backup script
# Configuration
BACKUP_ROOT="/backup"
S3_BUCKET="s3://company-backups"
GLACIER_BUCKET="s3://company-archives"
RETENTION_DAYS=30
# Database backup
backup_database() {
local db_name=$1
local backup_file="${BACKUP_ROOT}/db/${db_name}_$(date +%Y%m%d_%H%M%S).sql.gz"
pg_dump -h ${DB_HOST} -U ${DB_USER} -d ${db_name} | gzip > ${backup_file}
# Encrypt backup
gpg --symmetric --cipher-algo AES256 --batch --passphrase-file /etc/backup.key ${backup_file}
# Upload to S3
aws s3 cp ${backup_file}.gpg ${S3_BUCKET}/database/ --storage-class STANDARD_IA
# Archive older backups to Glacier
aws s3 sync ${S3_BUCKET}/database/ ${GLACIER_BUCKET}/database/ \
--exclude "*" --include "*_$(date -d '30 days ago' +%Y%m)*.gpg" \
--storage-class GLACIER
}
# Application backup
backup_application() {
local app_name=$1
local backup_file="${BACKUP_ROOT}/apps/${app_name}_$(date +%Y%m%d_%H%M%S).tar.gz"
tar -czf ${backup_file} /opt/${app_name} --exclude='*.log' --exclude='cache/*'
# Create incremental backup
rsync -avz --delete /opt/${app_name}/ ${BACKUP_ROOT}/incremental/${app_name}/
# Replicate to DR site
rsync -avz ${backup_file} dr-site:/backup/apps/
}
# Verify backups
verify_backups() {
# Test restore to temporary location
local test_dir="/tmp/backup_test_$(date +%s)"
mkdir -p ${test_dir}
# Restore and verify latest backup
latest_backup=$(aws s3 ls ${S3_BUCKET}/database/ | tail -1 | awk '{print $4}')
aws s3 cp ${S3_BUCKET}/database/${latest_backup} ${test_dir}/
# Decrypt and test
gpg --decrypt --batch --passphrase-file /etc/backup.key ${test_dir}/${latest_backup} | \
gunzip | psql -h localhost -U postgres -d test_restore
# Cleanup
rm -rf ${test_dir}
}
DR Procedures
Failover Runbook
# DR Activation Checklist
## 1. Assessment (0-15 minutes)
- [ ] Confirm primary site failure
- [ ] Assess impact scope
- [ ] Notify incident response team
- [ ] Decision: Activate DR?
## 2. Communication (15-30 minutes)
- [ ] Notify executive team
- [ ] Update status page
- [ ] Inform customer support
- [ ] Prepare customer communication
## 3. Technical Failover (30-60 minutes)
### Database Failover
```bash
# Promote DR database
aws rds promote-read-replica \
--db-instance-identifier prod-db-replica-dr
# Update connection strings
kubectl set env deployment/app \
DATABASE_URL=postgresql://dr-region.rds.amazonaws.com/prod
```
### Application Failover
```bash
# Scale up DR environment
kubectl scale deployment myapp-dr --replicas=10 -n production-dr
# Update DNS
aws route53 change-resource-record-sets \
--hosted-zone-id Z123456 \
--change-batch file://failover-dns.json
```
### Verify Services
```bash
# Health checks
curl -f https://dr.example.com/health || exit 1
# Smoke tests
./scripts/dr-smoke-tests.sh
```
## 4. Monitoring (60+ minutes)
- [ ] Monitor application metrics
- [ ] Check error rates
- [ ] Verify data integrity
- [ ] Customer impact assessment
Failback Procedures
# Failback to Primary Site
## 1. Primary Site Recovery
- Repair/replace failed components
- Restore systems to operational state
- Verify all services functioning
## 2. Data Synchronization
```bash
# Reverse replicate data changes
pg_dump -h dr-database -d prod | psql -h primary-database -d prod
# Verify data consistency
./scripts/data-consistency-check.sh primary dr
```
## 3. Controlled Failback
- Schedule maintenance window
- Gradually shift traffic back
- Monitor for issues
- Complete DNS updates
## 4. Post-Failback
- Document lessons learned
- Update procedures
- Test improvements
Testing and Validation
DR Test Schedule
Test Type | Frequency | Scope | Duration |
---|---|---|---|
Backup Verification | Daily | Automated restore test | 30 minutes |
Component Failover | Monthly | Individual services | 2 hours |
Partial DR Test | Quarterly | Critical systems only | 4 hours |
Full DR Exercise | Annually | Complete failover | 8 hours |
Test Automation
# Automated DR testing framework
class DRTestSuite:
def __init__(self):
self.test_results = []
def test_backup_integrity(self):
"""Verify backup files are valid and restorable"""
backups = self.list_recent_backups()
for backup in backups:
# Download backup
local_file = self.download_backup(backup)
# Verify checksum
assert self.verify_checksum(local_file)
# Test restore
restore_result = self.test_restore(local_file)
assert restore_result.success
# Verify data integrity
assert self.verify_data_integrity(restore_result.database)
def test_replication_lag(self):
"""Ensure replication lag is within acceptable limits"""
lag = self.get_replication_lag()
assert lag < timedelta(seconds=5)
def test_failover_automation(self):
"""Test automated failover scripts"""
# Create test failure condition
self.simulate_primary_failure()
# Wait for automatic failover
start_time = time.time()
while not self.is_dr_active():
if time.time() - start_time > 300: # 5 minute timeout
raise Exception("Failover did not complete in time")
time.sleep(10)
# Verify DR site is serving traffic
assert self.verify_dr_health()
# Restore primary
self.restore_primary_site()
Cloud-Specific DR
AWS Disaster Recovery
# AWS Backup for centralized backup management
resource "aws_backup_plan" "dr_plan" {
name = "comprehensive-dr-plan"
rule {
rule_name = "hourly_snapshots"
target_vault_name = aws_backup_vault.dr_vault.name
schedule = "cron(0 * * * ? *)"
lifecycle {
delete_after = 24
}
recovery_point_tags = {
Type = "Hourly"
}
}
rule {
rule_name = "daily_backups"
target_vault_name = aws_backup_vault.dr_vault.name
schedule = "cron(0 3 * * ? *)"
lifecycle {
cold_storage_after = 7
delete_after = 90
}
}
advanced_backup_setting {
backup_options = {
WindowsVSS = "enabled"
}
resource_type = "EC2"
}
}
# Cross-region replication
resource "aws_s3_bucket_replication_configuration" "dr_replication" {
role = aws_iam_role.replication.arn
bucket = aws_s3_bucket.primary.id
rule {
id = "dr-replication"
status = "Enabled"
destination {
bucket = aws_s3_bucket.dr.arn
storage_class = "STANDARD_IA"
replication_time {
status = "Enabled"
time {
minutes = 15
}
}
metrics {
status = "Enabled"
event_threshold {
minutes = 15
}
}
}
}
}
Multi-Cloud DR
# Terraform multi-cloud DR setup
# Primary in AWS
module "aws_primary" {
source = "./modules/aws-infrastructure"
region = "us-east-1"
environment = "production"
role = "primary"
}
# DR in Azure
module "azure_dr" {
source = "./modules/azure-infrastructure"
location = "eastus2"
environment = "production-dr"
role = "standby"
}
# GCP for additional redundancy
module "gcp_backup" {
source = "./modules/gcp-infrastructure"
region = "us-central1"
environment = "production-backup"
role = "backup"
}
# Cross-cloud data sync
resource "null_resource" "cross_cloud_sync" {
provisioner "local-exec" {
command = <<-EOT
# Sync data between clouds
rclone sync aws:prod-data azure:dr-data --transfers 32
rclone sync aws:prod-data gcs:backup-data --transfers 32
EOT
}
triggers = {
always_run = timestamp()
}
}
Communication Plan
Stakeholder Matrix
Stakeholder | Notification Time | Method | Information Level |
---|---|---|---|
Executive Team | Immediate | Phone, Email | High-level impact |
Technical Team | Immediate | PagerDuty, Slack | Technical details |
Customer Support | 15 minutes | Email, Slack | Customer talking points |
Customers | 30 minutes | Status page, Email | Service impact |
Communication Templates
# Initial notification
Subject: [URGENT] Service Disruption - DR Activation in Progress
We are currently experiencing a service disruption affecting [services].
Our team has activated disaster recovery procedures.
Current Status: Failover in progress
Estimated Resolution: [time]
Impact: [description]
Updates: [status page URL]
# Recovery notification
Subject: Service Restored - Post-Incident Report to Follow
Service has been restored as of [time].
All systems are operational.
Duration: [total time]
Impact: [summary]
Next Steps: Post-incident review scheduled for [date]
Full report will be provided within 24 hours.
Compliance and Audit
Regulatory Requirements
- HIPAA: Documented DR plan, annual testing
- PCI DSS: Backup testing, incident response
- SOC 2: Business continuity controls
- ISO 22301: Business continuity management
Documentation Requirements
- Business Impact Analysis (BIA)
- Risk Assessment Reports
- DR Procedures and Runbooks
- Test Results and Reports
- Training Records
- Vendor Agreements
- Insurance Policies
Continuous Improvement
Post-Incident Review
- What went well?
- What could be improved?
- Were RTOs/RPOs met?
- Communication effectiveness
- Tool and process gaps
- Training needs
Metrics and KPIs
- Actual vs. target RTO/RPO
- Test success rate
- Time to detect failures
- Backup success rate
- DR drill participation
Related Resources
Note: This documentation is provided for reference purposes only. It reflects general best practices and industry-aligned guidelines, and any examples, claims, or recommendations are intended as illustrative—not definitive or binding.