BCDR Planning Guide

13 min read
Updated Jun 19, 2025

BCDR Planning Guide

Comprehensive guide to Business Continuity and Disaster Recovery (BCDR) planning to ensure your organization can maintain operations and recover quickly from disruptions.

BCDR Fundamentals

Business Continuity and Disaster Recovery planning ensures your organization can continue operating during disruptions and recover critical systems within acceptable timeframes.

Key Concepts

  • RTO (Recovery Time Objective): Maximum acceptable downtime
  • RPO (Recovery Point Objective): Maximum acceptable data loss
  • MTTR (Mean Time To Recovery): Average recovery time
  • BIA (Business Impact Analysis): Impact assessment of disruptions
  • DR Tiers: Different levels of recovery capabilities

Business Impact Analysis

Critical System Classification

Tier Description RTO RPO Examples
Tier 0 Mission Critical < 1 hour < 15 minutes Payment processing, core database
Tier 1 Business Critical < 4 hours < 1 hour Customer portal, order system
Tier 2 Business Important < 24 hours < 4 hours Internal tools, reporting
Tier 3 Business Supporting < 72 hours < 24 hours Development, testing systems

Risk Assessment Matrix

# Risk calculation formula
Risk Score = Likelihood × Impact

# Risk categories
- Natural disasters: Earthquakes, floods, fires
- Technical failures: Hardware failure, software bugs
- Human errors: Misconfigurations, accidental deletion
- Security incidents: Ransomware, data breaches
- Infrastructure: Power outages, network failures

# Mitigation priorities
High Risk (Score 16-25): Immediate action required
Medium Risk (Score 9-15): Plan mitigation within 30 days
Low Risk (Score 1-8): Monitor and review quarterly

DR Architecture Patterns

1. Backup and Restore

# Cost-effective for Tier 2-3 systems
# RTO: Hours to days, RPO: Hours

# Implementation
- Regular automated backups to offsite location
- Documented restore procedures
- Periodic restore testing

# AWS Example
aws s3 sync /data s3://dr-backup-bucket/data --delete
aws backup create-backup-plan --backup-plan file://backup-plan.json

2. Pilot Light

# Minimal DR footprint for Tier 1-2 systems
# RTO: Hours, RPO: Minutes to hours

# Architecture
resource "aws_instance" "dr_database" {
  ami           = data.aws_ami.database.id
  instance_type = "t3.small"  # Minimal size
  
  # Scale up during DR activation
  lifecycle {
    ignore_changes = [instance_type]
  }
}

# Continuous data replication
resource "aws_db_instance" "primary" {
  # ... primary configuration
  
  # Enable automated backups for restore
  backup_retention_period = 7
  backup_window          = "03:00-04:00"
}

resource "aws_db_instance_replica" "dr" {
  replicate_source_db = aws_db_instance.primary.id
  instance_class      = "db.t3.small"  # Minimal size
}

3. Warm Standby

# Scaled-down running environment for Tier 0-1
# RTO: Minutes, RPO: Seconds

# Kubernetes multi-region setup
apiVersion: v1
kind: Service
metadata:
  name: global-app-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
  type: LoadBalancer
  selector:
    app: myapp
  ports:
    - port: 443
      targetPort: 8443

---
# Primary region deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-primary
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: myapp
      region: primary

---
# DR region deployment (reduced capacity)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-dr
  namespace: production-dr
spec:
  replicas: 3  # Scale up during DR
  selector:
    matchLabels:
      app: myapp
      region: dr

4. Active-Active Multi-Region

# Full redundancy for Tier 0 systems
# RTO: 0, RPO: Near 0

# Traffic management with Route53
resource "aws_route53_health_check" "primary" {
  fqdn              = "primary.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = "3"
  request_interval  = "30"
}

resource "aws_route53_record" "www" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "www.example.com"
  type    = "A"
  
  set_identifier = "Primary"
  health_check_id = aws_route53_health_check.primary.id
  
  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }
  
  failover_routing_policy {
    type = "PRIMARY"
  }
}

# Database multi-master replication
CREATE PUBLICATION myapp_pub FOR ALL TABLES;

-- On DR region
CREATE SUBSCRIPTION myapp_sub
CONNECTION 'host=primary.region.rds.amazonaws.com dbname=myapp'
PUBLICATION myapp_pub
WITH (copy_data = false, synchronous_commit = 'remote_apply');

Backup Strategies

3-2-1 Backup Rule

  • 3 copies of important data
  • 2 different storage media types
  • 1 offsite backup copy

Automated Backup Implementation

#!/bin/bash
# Comprehensive backup script

# Configuration
BACKUP_ROOT="/backup"
S3_BUCKET="s3://company-backups"
GLACIER_BUCKET="s3://company-archives"
RETENTION_DAYS=30

# Database backup
backup_database() {
    local db_name=$1
    local backup_file="${BACKUP_ROOT}/db/${db_name}_$(date +%Y%m%d_%H%M%S).sql.gz"
    
    pg_dump -h ${DB_HOST} -U ${DB_USER} -d ${db_name} | gzip > ${backup_file}
    
    # Encrypt backup
    gpg --symmetric --cipher-algo AES256 --batch --passphrase-file /etc/backup.key ${backup_file}
    
    # Upload to S3
    aws s3 cp ${backup_file}.gpg ${S3_BUCKET}/database/ --storage-class STANDARD_IA
    
    # Archive older backups to Glacier
    aws s3 sync ${S3_BUCKET}/database/ ${GLACIER_BUCKET}/database/ \
        --exclude "*" --include "*_$(date -d '30 days ago' +%Y%m)*.gpg" \
        --storage-class GLACIER
}

# Application backup
backup_application() {
    local app_name=$1
    local backup_file="${BACKUP_ROOT}/apps/${app_name}_$(date +%Y%m%d_%H%M%S).tar.gz"
    
    tar -czf ${backup_file} /opt/${app_name} --exclude='*.log' --exclude='cache/*'
    
    # Create incremental backup
    rsync -avz --delete /opt/${app_name}/ ${BACKUP_ROOT}/incremental/${app_name}/
    
    # Replicate to DR site
    rsync -avz ${backup_file} dr-site:/backup/apps/
}

# Verify backups
verify_backups() {
    # Test restore to temporary location
    local test_dir="/tmp/backup_test_$(date +%s)"
    mkdir -p ${test_dir}
    
    # Restore and verify latest backup
    latest_backup=$(aws s3 ls ${S3_BUCKET}/database/ | tail -1 | awk '{print $4}')
    aws s3 cp ${S3_BUCKET}/database/${latest_backup} ${test_dir}/
    
    # Decrypt and test
    gpg --decrypt --batch --passphrase-file /etc/backup.key ${test_dir}/${latest_backup} | \
        gunzip | psql -h localhost -U postgres -d test_restore
    
    # Cleanup
    rm -rf ${test_dir}
}

DR Procedures

Failover Runbook

# DR Activation Checklist

## 1. Assessment (0-15 minutes)
- [ ] Confirm primary site failure
- [ ] Assess impact scope
- [ ] Notify incident response team
- [ ] Decision: Activate DR?

## 2. Communication (15-30 minutes)
- [ ] Notify executive team
- [ ] Update status page
- [ ] Inform customer support
- [ ] Prepare customer communication

## 3. Technical Failover (30-60 minutes)
### Database Failover
```bash
# Promote DR database
aws rds promote-read-replica \
  --db-instance-identifier prod-db-replica-dr

# Update connection strings
kubectl set env deployment/app \
  DATABASE_URL=postgresql://dr-region.rds.amazonaws.com/prod
```

### Application Failover
```bash
# Scale up DR environment
kubectl scale deployment myapp-dr --replicas=10 -n production-dr

# Update DNS
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch file://failover-dns.json
```

### Verify Services
```bash
# Health checks
curl -f https://dr.example.com/health || exit 1

# Smoke tests
./scripts/dr-smoke-tests.sh
```

## 4. Monitoring (60+ minutes)
- [ ] Monitor application metrics
- [ ] Check error rates
- [ ] Verify data integrity
- [ ] Customer impact assessment

Failback Procedures

# Failback to Primary Site

## 1. Primary Site Recovery
- Repair/replace failed components
- Restore systems to operational state
- Verify all services functioning

## 2. Data Synchronization
```bash
# Reverse replicate data changes
pg_dump -h dr-database -d prod | psql -h primary-database -d prod

# Verify data consistency
./scripts/data-consistency-check.sh primary dr
```

## 3. Controlled Failback
- Schedule maintenance window
- Gradually shift traffic back
- Monitor for issues
- Complete DNS updates

## 4. Post-Failback
- Document lessons learned
- Update procedures
- Test improvements

Testing and Validation

DR Test Schedule

Test Type Frequency Scope Duration
Backup Verification Daily Automated restore test 30 minutes
Component Failover Monthly Individual services 2 hours
Partial DR Test Quarterly Critical systems only 4 hours
Full DR Exercise Annually Complete failover 8 hours

Test Automation

# Automated DR testing framework
class DRTestSuite:
    def __init__(self):
        self.test_results = []
        
    def test_backup_integrity(self):
        """Verify backup files are valid and restorable"""
        backups = self.list_recent_backups()
        
        for backup in backups:
            # Download backup
            local_file = self.download_backup(backup)
            
            # Verify checksum
            assert self.verify_checksum(local_file)
            
            # Test restore
            restore_result = self.test_restore(local_file)
            assert restore_result.success
            
            # Verify data integrity
            assert self.verify_data_integrity(restore_result.database)
    
    def test_replication_lag(self):
        """Ensure replication lag is within acceptable limits"""
        lag = self.get_replication_lag()
        assert lag < timedelta(seconds=5)
    
    def test_failover_automation(self):
        """Test automated failover scripts"""
        # Create test failure condition
        self.simulate_primary_failure()
        
        # Wait for automatic failover
        start_time = time.time()
        while not self.is_dr_active():
            if time.time() - start_time > 300:  # 5 minute timeout
                raise Exception("Failover did not complete in time")
            time.sleep(10)
        
        # Verify DR site is serving traffic
        assert self.verify_dr_health()
        
        # Restore primary
        self.restore_primary_site()

Cloud-Specific DR

AWS Disaster Recovery

# AWS Backup for centralized backup management
resource "aws_backup_plan" "dr_plan" {
  name = "comprehensive-dr-plan"
  
  rule {
    rule_name         = "hourly_snapshots"
    target_vault_name = aws_backup_vault.dr_vault.name
    schedule          = "cron(0 * * * ? *)"
    
    lifecycle {
      delete_after = 24
    }
    
    recovery_point_tags = {
      Type = "Hourly"
    }
  }
  
  rule {
    rule_name         = "daily_backups"
    target_vault_name = aws_backup_vault.dr_vault.name
    schedule          = "cron(0 3 * * ? *)"
    
    lifecycle {
      cold_storage_after = 7
      delete_after       = 90
    }
  }
  
  advanced_backup_setting {
    backup_options = {
      WindowsVSS = "enabled"
    }
    resource_type = "EC2"
  }
}

# Cross-region replication
resource "aws_s3_bucket_replication_configuration" "dr_replication" {
  role   = aws_iam_role.replication.arn
  bucket = aws_s3_bucket.primary.id
  
  rule {
    id     = "dr-replication"
    status = "Enabled"
    
    destination {
      bucket        = aws_s3_bucket.dr.arn
      storage_class = "STANDARD_IA"
      
      replication_time {
        status = "Enabled"
        time {
          minutes = 15
        }
      }
      
      metrics {
        status = "Enabled"
        event_threshold {
          minutes = 15
        }
      }
    }
  }
}

Multi-Cloud DR

# Terraform multi-cloud DR setup
# Primary in AWS
module "aws_primary" {
  source = "./modules/aws-infrastructure"
  
  region      = "us-east-1"
  environment = "production"
  role        = "primary"
}

# DR in Azure
module "azure_dr" {
  source = "./modules/azure-infrastructure"
  
  location    = "eastus2"
  environment = "production-dr"
  role        = "standby"
}

# GCP for additional redundancy
module "gcp_backup" {
  source = "./modules/gcp-infrastructure"
  
  region      = "us-central1"
  environment = "production-backup"
  role        = "backup"
}

# Cross-cloud data sync
resource "null_resource" "cross_cloud_sync" {
  provisioner "local-exec" {
    command = <<-EOT
      # Sync data between clouds
      rclone sync aws:prod-data azure:dr-data --transfers 32
      rclone sync aws:prod-data gcs:backup-data --transfers 32
    EOT
  }
  
  triggers = {
    always_run = timestamp()
  }
}

Communication Plan

Stakeholder Matrix

Stakeholder Notification Time Method Information Level
Executive Team Immediate Phone, Email High-level impact
Technical Team Immediate PagerDuty, Slack Technical details
Customer Support 15 minutes Email, Slack Customer talking points
Customers 30 minutes Status page, Email Service impact

Communication Templates

# Initial notification
Subject: [URGENT] Service Disruption - DR Activation in Progress

We are currently experiencing a service disruption affecting [services].
Our team has activated disaster recovery procedures.

Current Status: Failover in progress
Estimated Resolution: [time]
Impact: [description]

Updates: [status page URL]

# Recovery notification
Subject: Service Restored - Post-Incident Report to Follow

Service has been restored as of [time].
All systems are operational.

Duration: [total time]
Impact: [summary]
Next Steps: Post-incident review scheduled for [date]

Full report will be provided within 24 hours.

Compliance and Audit

Regulatory Requirements

  • HIPAA: Documented DR plan, annual testing
  • PCI DSS: Backup testing, incident response
  • SOC 2: Business continuity controls
  • ISO 22301: Business continuity management

Documentation Requirements

  • Business Impact Analysis (BIA)
  • Risk Assessment Reports
  • DR Procedures and Runbooks
  • Test Results and Reports
  • Training Records
  • Vendor Agreements
  • Insurance Policies

Continuous Improvement

Post-Incident Review

  • What went well?
  • What could be improved?
  • Were RTOs/RPOs met?
  • Communication effectiveness
  • Tool and process gaps
  • Training needs

Metrics and KPIs

  • Actual vs. target RTO/RPO
  • Test success rate
  • Time to detect failures
  • Backup success rate
  • DR drill participation
Note: This documentation is provided for reference purposes only. It reflects general best practices and industry-aligned guidelines, and any examples, claims, or recommendations are intended as illustrative—not definitive or binding.