BCDR Planning Guide

BCDR Fundamentals

Business Continuity and Disaster Recovery planning ensures your organization can continue operating during disruptions and recover critical systems within acceptable timeframes.

Key Concepts

RTO (Recovery Time Objective): Maximum acceptable downtime
RPO (Recovery Point Objective): Maximum acceptable data loss
MTTR (Mean Time To Recovery): Average recovery time
BIA (Business Impact Analysis): Impact assessment of disruptions
DR Tiers: Different levels of recovery capabilities

Business Impact Analysis

Critical System Classification

Tier	Description	RTO	RPO	Examples
Tier 0	Mission Critical	< 1 hour	< 15 minutes	Payment processing, core database
Tier 1	Business Critical	< 4 hours	< 1 hour	Customer portal, order system
Tier 2	Business Important	< 24 hours	< 4 hours	Internal tools, reporting
Tier 3	Business Supporting	< 72 hours	< 24 hours	Development, testing systems

Risk Assessment Matrix

# Risk calculation formula
Risk Score = Likelihood × Impact

# Risk categories
- Natural disasters: Earthquakes, floods, fires
- Technical failures: Hardware failure, software bugs
- Human errors: Misconfigurations, accidental deletion
- Security incidents: Ransomware, data breaches
- Infrastructure: Power outages, network failures

# Mitigation priorities
High Risk (Score 16-25): Immediate action required
Medium Risk (Score 9-15): Plan mitigation within 30 days
Low Risk (Score 1-8): Monitor and review quarterly

DR Architecture Patterns

1. Backup and Restore

# Cost-effective for Tier 2-3 systems
# RTO: Hours to days, RPO: Hours

# Implementation
- Regular automated backups to offsite location
- Documented restore procedures
- Periodic restore testing

# AWS Example
aws s3 sync /data s3://dr-backup-bucket/data --delete
aws backup create-backup-plan --backup-plan file://backup-plan.json

2. Pilot Light

# Minimal DR footprint for Tier 1-2 systems
# RTO: Hours, RPO: Minutes to hours

# Architecture
resource "aws_instance" "dr_database" {
  ami           = data.aws_ami.database.id
  instance_type = "t3.small"  # Minimal size
  
  # Scale up during DR activation
  lifecycle {
    ignore_changes = [instance_type]
  }
}

# Continuous data replication
resource "aws_db_instance" "primary" {
  # ... primary configuration
  
  # Enable automated backups for restore
  backup_retention_period = 7
  backup_window          = "03:00-04:00"
}

resource "aws_db_instance_replica" "dr" {
  replicate_source_db = aws_db_instance.primary.id
  instance_class      = "db.t3.small"  # Minimal size
}

3. Warm Standby

# Scaled-down running environment for Tier 0-1
# RTO: Minutes, RPO: Seconds

# Kubernetes multi-region setup
apiVersion: v1
kind: Service
metadata:
  name: global-app-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
  type: LoadBalancer
  selector:
    app: myapp
  ports:
    - port: 443
      targetPort: 8443

---
# Primary region deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-primary
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: myapp
      region: primary

---
# DR region deployment (reduced capacity)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-dr
  namespace: production-dr
spec:
  replicas: 3  # Scale up during DR
  selector:
    matchLabels:
      app: myapp
      region: dr

4. Active-Active Multi-Region

# Full redundancy for Tier 0 systems
# RTO: 0, RPO: Near 0

# Traffic management with Route53
resource "aws_route53_health_check" "primary" {
  fqdn              = "primary.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = "3"
  request_interval  = "30"
}

resource "aws_route53_record" "www" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "www.example.com"
  type    = "A"
  
  set_identifier = "Primary"
  health_check_id = aws_route53_health_check.primary.id
  
  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }
  
  failover_routing_policy {
    type = "PRIMARY"
  }
}

# Database multi-master replication
CREATE PUBLICATION myapp_pub FOR ALL TABLES;

-- On DR region
CREATE SUBSCRIPTION myapp_sub
CONNECTION 'host=primary.region.rds.amazonaws.com dbname=myapp'
PUBLICATION myapp_pub
WITH (copy_data = false, synchronous_commit = 'remote_apply');

Backup Strategies

3-2-1 Backup Rule

3 copies of important data
2 different storage media types
1 offsite backup copy

Automated Backup Implementation

#!/bin/bash
# Comprehensive backup script

# Configuration
BACKUP_ROOT="/backup"
S3_BUCKET="s3://company-backups"
GLACIER_BUCKET="s3://company-archives"
RETENTION_DAYS=30

# Database backup
backup_database() {
    local db_name=$1
    local backup_file="${BACKUP_ROOT}/db/${db_name}_$(date +%Y%m%d_%H%M%S).sql.gz"
    
    pg_dump -h ${DB_HOST} -U ${DB_USER} -d ${db_name} | gzip > ${backup_file}
    
    # Encrypt backup
    gpg --symmetric --cipher-algo AES256 --batch --passphrase-file /etc/backup.key ${backup_file}
    
    # Upload to S3
    aws s3 cp ${backup_file}.gpg ${S3_BUCKET}/database/ --storage-class STANDARD_IA
    
    # Archive older backups to Glacier
    aws s3 sync ${S3_BUCKET}/database/ ${GLACIER_BUCKET}/database/ \
        --exclude "*" --include "*_$(date -d '30 days ago' +%Y%m)*.gpg" \
        --storage-class GLACIER
}

# Application backup
backup_application() {
    local app_name=$1
    local backup_file="${BACKUP_ROOT}/apps/${app_name}_$(date +%Y%m%d_%H%M%S).tar.gz"
    
    tar -czf ${backup_file} /opt/${app_name} --exclude='*.log' --exclude='cache/*'
    
    # Create incremental backup
    rsync -avz --delete /opt/${app_name}/ ${BACKUP_ROOT}/incremental/${app_name}/
    
    # Replicate to DR site
    rsync -avz ${backup_file} dr-site:/backup/apps/
}

# Verify backups
verify_backups() {
    # Test restore to temporary location
    local test_dir="/tmp/backup_test_$(date +%s)"
    mkdir -p ${test_dir}
    
    # Restore and verify latest backup
    latest_backup=$(aws s3 ls ${S3_BUCKET}/database/ | tail -1 | awk '{print $4}')
    aws s3 cp ${S3_BUCKET}/database/${latest_backup} ${test_dir}/
    
    # Decrypt and test
    gpg --decrypt --batch --passphrase-file /etc/backup.key ${test_dir}/${latest_backup} | \
        gunzip | psql -h localhost -U postgres -d test_restore
    
    # Cleanup
    rm -rf ${test_dir}
}

DR Procedures

Failover Runbook

# DR Activation Checklist

## 1. Assessment (0-15 minutes)
- [ ] Confirm primary site failure
- [ ] Assess impact scope
- [ ] Notify incident response team
- [ ] Decision: Activate DR?

## 2. Communication (15-30 minutes)
- [ ] Notify executive team
- [ ] Update status page
- [ ] Inform customer support
- [ ] Prepare customer communication

## 3. Technical Failover (30-60 minutes)
### Database Failover
```bash
# Promote DR database
aws rds promote-read-replica \
  --db-instance-identifier prod-db-replica-dr

# Update connection strings
kubectl set env deployment/app \
  DATABASE_URL=postgresql://dr-region.rds.amazonaws.com/prod
```

### Application Failover
```bash
# Scale up DR environment
kubectl scale deployment myapp-dr --replicas=10 -n production-dr

# Update DNS
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch file://failover-dns.json
```

### Verify Services
```bash
# Health checks
curl -f https://dr.example.com/health || exit 1

# Smoke tests
./scripts/dr-smoke-tests.sh
```

## 4. Monitoring (60+ minutes)
- [ ] Monitor application metrics
- [ ] Check error rates
- [ ] Verify data integrity
- [ ] Customer impact assessment

Failback Procedures

# Failback to Primary Site

## 1. Primary Site Recovery
- Repair/replace failed components
- Restore systems to operational state
- Verify all services functioning

## 2. Data Synchronization
```bash
# Reverse replicate data changes
pg_dump -h dr-database -d prod | psql -h primary-database -d prod

# Verify data consistency
./scripts/data-consistency-check.sh primary dr
```

## 3. Controlled Failback
- Schedule maintenance window
- Gradually shift traffic back
- Monitor for issues
- Complete DNS updates

## 4. Post-Failback
- Document lessons learned
- Update procedures
- Test improvements

Testing and Validation

DR Test Schedule

Test Type	Frequency	Scope	Duration
Backup Verification	Daily	Automated restore test	30 minutes
Component Failover	Monthly	Individual services	2 hours
Partial DR Test	Quarterly	Critical systems only	4 hours
Full DR Exercise	Annually	Complete failover	8 hours

Test Automation

# Automated DR testing framework
class DRTestSuite:
    def __init__(self):
        self.test_results = []
        
    def test_backup_integrity(self):
        """Verify backup files are valid and restorable"""
        backups = self.list_recent_backups()
        
        for backup in backups:
            # Download backup
            local_file = self.download_backup(backup)
            
            # Verify checksum
            assert self.verify_checksum(local_file)
            
            # Test restore
            restore_result = self.test_restore(local_file)
            assert restore_result.success
            
            # Verify data integrity
            assert self.verify_data_integrity(restore_result.database)
    
    def test_replication_lag(self):
        """Ensure replication lag is within acceptable limits"""
        lag = self.get_replication_lag()
        assert lag < timedelta(seconds=5)
    
    def test_failover_automation(self):
        """Test automated failover scripts"""
        # Create test failure condition
        self.simulate_primary_failure()
        
        # Wait for automatic failover
        start_time = time.time()
        while not self.is_dr_active():
            if time.time() - start_time > 300:  # 5 minute timeout
                raise Exception("Failover did not complete in time")
            time.sleep(10)
        
        # Verify DR site is serving traffic
        assert self.verify_dr_health()
        
        # Restore primary
        self.restore_primary_site()

Cloud-Specific DR

AWS Disaster Recovery

# AWS Backup for centralized backup management
resource "aws_backup_plan" "dr_plan" {
  name = "comprehensive-dr-plan"
  
  rule {
    rule_name         = "hourly_snapshots"
    target_vault_name = aws_backup_vault.dr_vault.name
    schedule          = "cron(0 * * * ? *)"
    
    lifecycle {
      delete_after = 24
    }
    
    recovery_point_tags = {
      Type = "Hourly"
    }
  }
  
  rule {
    rule_name         = "daily_backups"
    target_vault_name = aws_backup_vault.dr_vault.name
    schedule          = "cron(0 3 * * ? *)"
    
    lifecycle {
      cold_storage_after = 7
      delete_after       = 90
    }
  }
  
  advanced_backup_setting {
    backup_options = {
      WindowsVSS = "enabled"
    }
    resource_type = "EC2"
  }
}

# Cross-region replication
resource "aws_s3_bucket_replication_configuration" "dr_replication" {
  role   = aws_iam_role.replication.arn
  bucket = aws_s3_bucket.primary.id
  
  rule {
    id     = "dr-replication"
    status = "Enabled"
    
    destination {
      bucket        = aws_s3_bucket.dr.arn
      storage_class = "STANDARD_IA"
      
      replication_time {
        status = "Enabled"
        time {
          minutes = 15
        }
      }
      
      metrics {
        status = "Enabled"
        event_threshold {
          minutes = 15
        }
      }
    }
  }
}

Multi-Cloud DR

# Terraform multi-cloud DR setup
# Primary in AWS
module "aws_primary" {
  source = "./modules/aws-infrastructure"
  
  region      = "us-east-1"
  environment = "production"
  role        = "primary"
}

# DR in Azure
module "azure_dr" {
  source = "./modules/azure-infrastructure"
  
  location    = "eastus2"
  environment = "production-dr"
  role        = "standby"
}

# GCP for additional redundancy
module "gcp_backup" {
  source = "./modules/gcp-infrastructure"
  
  region      = "us-central1"
  environment = "production-backup"
  role        = "backup"
}

# Cross-cloud data sync
resource "null_resource" "cross_cloud_sync" {
  provisioner "local-exec" {
    command = <<-EOT
      # Sync data between clouds
      rclone sync aws:prod-data azure:dr-data --transfers 32
      rclone sync aws:prod-data gcs:backup-data --transfers 32
    EOT
  }
  
  triggers = {
    always_run = timestamp()
  }
}

Communication Plan

Stakeholder Matrix

Stakeholder	Notification Time	Method	Information Level
Executive Team	Immediate	Phone, Email	High-level impact
Technical Team	Immediate	PagerDuty, Slack	Technical details
Customer Support	15 minutes	Email, Slack	Customer talking points
Customers	30 minutes	Status page, Email	Service impact

Communication Templates

# Initial notification
Subject: [URGENT] Service Disruption - DR Activation in Progress

We are currently experiencing a service disruption affecting [services].
Our team has activated disaster recovery procedures.

Current Status: Failover in progress
Estimated Resolution: [time]
Impact: [description]

Updates: [status page URL]

# Recovery notification
Subject: Service Restored - Post-Incident Report to Follow

Service has been restored as of [time].
All systems are operational.

Duration: [total time]
Impact: [summary]
Next Steps: Post-incident review scheduled for [date]

Full report will be provided within 24 hours.

Compliance and Audit

Regulatory Requirements

HIPAA: Documented DR plan, annual testing
PCI DSS: Backup testing, incident response
SOC 2: Business continuity controls
ISO 22301: Business continuity management

Documentation Requirements

Business Impact Analysis (BIA)
Risk Assessment Reports
DR Procedures and Runbooks
Test Results and Reports
Training Records
Vendor Agreements
Insurance Policies

Continuous Improvement

Post-Incident Review

What went well?
What could be improved?
Were RTOs/RPOs met?
Communication effectiveness
Tool and process gaps
Training needs

Metrics and KPIs

Actual vs. target RTO/RPO
Test success rate
Time to detect failures
Backup success rate
DR drill participation

BCDR Fundamentals

Key Concepts

Business Impact Analysis

Critical System Classification

Risk Assessment Matrix

DR Architecture Patterns

1. Backup and Restore

2. Pilot Light

3. Warm Standby

4. Active-Active Multi-Region

Backup Strategies

3-2-1 Backup Rule

Automated Backup Implementation

DR Procedures

Failover Runbook

Failback Procedures

Testing and Validation

DR Test Schedule

Test Automation

Cloud-Specific DR

AWS Disaster Recovery

Multi-Cloud DR

Communication Plan

Stakeholder Matrix

Communication Templates

Compliance and Audit

Regulatory Requirements

Documentation Requirements

Continuous Improvement

Post-Incident Review

Metrics and KPIs

Related Resources

Related Documentation

Kubernetes Deployment Guide

Deploy Kubernetes Cluster

Kubernetes Best Practices