Cloud-Native Disaster Recovery Architectures: Enterprise Implementation Guide
Disaster recovery (DR) in cloud-native environments requires fundamentally different approaches compared to traditional infrastructure. With Kubernetes orchestrating containerized workloads across distributed systems, organizations must implement sophisticated DR strategies that account for application state, persistent data, configuration, and infrastructure-as-code. This comprehensive guide covers enterprise-grade disaster recovery architectures, automation patterns, and operational procedures for cloud-native environments.
Disaster Recovery Fundamentals
RTO and RPO Definition
Understanding Recovery Time Objective (RTO) and Recovery Point Objective (RPO) is critical for designing appropriate DR strategies:
# DR tier classification with RTO/RPO requirements
---
apiVersion: v1
kind: ConfigMap
metadata:
name: dr-tier-definitions
namespace: dr-system
data:
tiers.yaml: |
# Tier 1: Mission Critical (Highest priority)
tier1:
rto: "< 15 minutes"
rpo: "< 5 minutes"
replication: "synchronous"
backup_frequency: "continuous"
testing_frequency: "monthly"
cost_multiplier: 5x
examples:
- payment-processing
- authentication-services
- core-api-services
# Tier 2: Business Critical
tier2:
rto: "< 1 hour"
rpo: "< 15 minutes"
replication: "asynchronous"
backup_frequency: "every-15-min"
testing_frequency: "quarterly"
cost_multiplier: 3x
examples:
- customer-portal
- order-management
- inventory-system
# Tier 3: Business Important
tier3:
rto: "< 4 hours"
rpo: "< 1 hour"
replication: "scheduled"
backup_frequency: "hourly"
testing_frequency: "biannually"
cost_multiplier: 2x
examples:
- reporting-services
- analytics-platform
- admin-tools
# Tier 4: Business Operational
tier4:
rto: "< 24 hours"
rpo: "< 4 hours"
replication: "daily"
backup_frequency: "every-4-hours"
testing_frequency: "annually"
cost_multiplier: 1.5x
examples:
- internal-tools
- development-environments
- test-systems
---
# Application DR tier labeling
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-processor
namespace: production
labels:
app: payment-processor
dr-tier: tier1
criticality: mission-critical
annotations:
dr.policy/rto: "15m"
dr.policy/rpo: "5m"
dr.policy/backup-frequency: "continuous"
dr.policy/replication: "synchronous"
spec:
replicas: 10
selector:
matchLabels:
app: payment-processor
template:
metadata:
labels:
app: payment-processor
dr-tier: tier1
spec:
containers:
- name: processor
image: payment-processor:v2.5.1
resources:
requests:
memory: "2Gi"
cpu: "1000m"
DR Architecture Patterns
# Active-Active Multi-Region Architecture
---
apiVersion: v1
kind: ConfigMap
metadata:
name: dr-active-active-config
namespace: dr-system
data:
architecture.yaml: |
pattern: active-active
description: "Traffic distributed across multiple active regions"
regions:
- name: us-east-1
role: primary
traffic_percentage: 40
capacity: 100%
data_replication: bidirectional
- name: us-west-2
role: primary
traffic_percentage: 35
capacity: 100%
data_replication: bidirectional
- name: eu-west-1
role: primary
traffic_percentage: 25
capacity: 100%
data_replication: bidirectional
failover:
type: automatic
detection_window: 30s
traffic_shift_duration: 60s
data_consistency: eventual
benefits:
- Zero RTO for regional failures
- Improved performance through geographic distribution
- Load distribution across regions
- No idle capacity
challenges:
- Data consistency complexity
- Higher operational cost
- Complex conflict resolution
- Increased network traffic
---
# Active-Passive Architecture
apiVersion: v1
kind: ConfigMap
metadata:
name: dr-active-passive-config
namespace: dr-system
data:
architecture.yaml: |
pattern: active-passive
description: "Primary region active, standby region ready for failover"
regions:
- name: us-east-1
role: primary
traffic_percentage: 100
capacity: 100%
data_replication: source
- name: us-west-2
role: standby
traffic_percentage: 0
capacity: 30% # Warm standby
data_replication: destination
failover:
type: manual-with-automation
detection_window: 2m
approval_required: true
traffic_shift_duration: 5m
capacity_scale_duration: 10m
benefits:
- Lower operational cost
- Simpler data consistency
- Clear primary/secondary roles
- Reduced complexity
challenges:
- Higher RTO (5-15 minutes)
- Idle standby capacity cost
- Requires failover procedure
- Data replication lag
Kubernetes Backup Strategies
Velero for Cluster Backup
# Velero installation with multi-region support
---
apiVersion: v1
kind: Namespace
metadata:
name: velero
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: velero
namespace: velero
---
# AWS credentials for S3 backup storage
apiVersion: v1
kind: Secret
metadata:
name: cloud-credentials
namespace: velero
type: Opaque
stringData:
cloud: |
[default]
aws_access_key_id=${AWS_ACCESS_KEY_ID}
aws_secret_access_key=${AWS_SECRET_ACCESS_KEY}
---
# Velero BackupStorageLocation for primary region
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: primary-s3
namespace: velero
spec:
provider: aws
objectStorage:
bucket: velero-backups-us-east-1
prefix: production
config:
region: us-east-1
s3ForcePathStyle: "false"
s3Url: https://s3.us-east-1.amazonaws.com
---
# DR region backup storage
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: dr-s3
namespace: velero
spec:
provider: aws
objectStorage:
bucket: velero-backups-us-west-2
prefix: production
config:
region: us-west-2
s3ForcePathStyle: "false"
s3Url: https://s3.us-west-2.amazonaws.com
---
# Volume snapshot location
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
name: aws-snapshots
namespace: velero
spec:
provider: aws
config:
region: us-east-1
---
# Scheduled backup for Tier 1 applications
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: tier1-backup
namespace: velero
spec:
schedule: "*/5 * * * *" # Every 5 minutes
template:
includedNamespaces:
- production
- payment-system
includedResources:
- '*'
labelSelector:
matchLabels:
dr-tier: tier1
snapshotVolumes: true
storageLocation: primary-s3
volumeSnapshotLocations:
- aws-snapshots
ttl: 168h # 7 days retention
hooks:
resources:
- name: postgres-backup-hook
includedNamespaces:
- production
labelSelector:
matchLabels:
app: postgres
pre:
- exec:
container: postgres
command:
- /bin/bash
- -c
- pg_dump -U $POSTGRES_USER $POSTGRES_DB > /tmp/backup.sql
onError: Fail
timeout: 10m
post:
- exec:
container: postgres
command:
- /bin/bash
- -c
- rm -f /tmp/backup.sql
---
# Tier 2 backup schedule
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: tier2-backup
namespace: velero
spec:
schedule: "*/15 * * * *" # Every 15 minutes
template:
includedNamespaces:
- production
labelSelector:
matchLabels:
dr-tier: tier2
snapshotVolumes: true
storageLocation: primary-s3
ttl: 336h # 14 days retention
---
# Daily full cluster backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: full-cluster-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 2 AM daily
template:
includedNamespaces:
- '*'
includedResources:
- '*'
snapshotVolumes: true
storageLocation: primary-s3
volumeSnapshotLocations:
- aws-snapshots
ttl: 720h # 30 days retention
Application-Aware Backup Hooks
# MySQL backup with consistent snapshots
---
apiVersion: v1
kind: ConfigMap
metadata:
name: mysql-backup-hooks
namespace: production
data:
pre-backup.sh: |
#!/bin/bash
set -e
echo "Starting MySQL pre-backup hook"
# Flush tables and acquire read lock
mysql -u root -p${MYSQL_ROOT_PASSWORD} <<EOF
FLUSH TABLES WITH READ LOCK;
SYSTEM /backup/create-marker.sh;
EOF
echo "MySQL pre-backup hook completed"
post-backup.sh: |
#!/bin/bash
set -e
echo "Starting MySQL post-backup hook"
# Release read lock
mysql -u root -p${MYSQL_ROOT_PASSWORD} <<EOF
UNLOCK TABLES;
EOF
echo "MySQL post-backup hook completed"
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mysql
namespace: production
labels:
app: mysql
dr-tier: tier1
annotations:
backup.velero.io/backup-volumes: data
pre.hook.backup.velero.io/container: mysql
pre.hook.backup.velero.io/command: '["/bin/bash", "/backup/pre-backup.sh"]'
pre.hook.backup.velero.io/on-error: Fail
post.hook.backup.velero.io/container: mysql
post.hook.backup.velero.io/command: '["/bin/bash", "/backup/post-backup.sh"]'
spec:
serviceName: mysql
replicas: 3
selector:
matchLabels:
app: mysql
template:
metadata:
labels:
app: mysql
spec:
containers:
- name: mysql
image: mysql:8.0
env:
- name: MYSQL_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: mysql-secret
key: root-password
volumeMounts:
- name: data
mountPath: /var/lib/mysql
- name: backup-hooks
mountPath: /backup
volumes:
- name: backup-hooks
configMap:
name: mysql-backup-hooks
defaultMode: 0755
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
Cross-Region Data Replication
PostgreSQL Streaming Replication
# PostgreSQL with cross-region replication
---
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-primary-config
namespace: database
data:
postgresql.conf: |
# Connection settings
listen_addresses = '*'
max_connections = 500
superuser_reserved_connections = 10
# Memory settings
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
work_mem = 20MB
# WAL settings for replication
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
wal_keep_size = 1GB
hot_standby = on
hot_standby_feedback = on
# Checkpoint settings
checkpoint_completion_target = 0.9
checkpoint_timeout = 15min
max_wal_size = 4GB
min_wal_size = 1GB
# Replication settings
synchronous_commit = remote_apply # For synchronous replication
synchronous_standby_names = 'standby1,standby2'
# Archive settings
archive_mode = on
archive_command = 'aws s3 cp %p s3://postgres-wal-archive/%f --region us-east-1'
restore_command = 'aws s3 cp s3://postgres-wal-archive/%f %p --region us-east-1'
pg_hba.conf: |
local all all trust
host all all 127.0.0.1/32 trust
host all all ::1/128 trust
host replication replicator 0.0.0.0/0 md5
host all all 0.0.0.0/0 md5
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres-primary
namespace: database
labels:
app: postgres
role: primary
region: us-east-1
dr-tier: tier1
spec:
serviceName: postgres-primary
replicas: 1
selector:
matchLabels:
app: postgres
role: primary
template:
metadata:
labels:
app: postgres
role: primary
spec:
containers:
- name: postgres
image: postgres:15.3
ports:
- containerPort: 5432
name: postgres
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
- name: POSTGRES_USER
value: "admin"
- name: POSTGRES_DB
value: "production"
- name: PGDATA
value: "/var/lib/postgresql/data/pgdata"
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
- name: config
mountPath: /etc/postgresql
- name: wal-archive
mountPath: /wal-archive
livenessProbe:
exec:
command:
- /bin/sh
- -c
- pg_isready -U admin
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
exec:
command:
- /bin/sh
- -c
- pg_isready -U admin
initialDelaySeconds: 10
periodSeconds: 5
resources:
requests:
memory: "8Gi"
cpu: "4000m"
limits:
memory: "16Gi"
cpu: "8000m"
volumes:
- name: config
configMap:
name: postgres-primary-config
- name: wal-archive
emptyDir: {}
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 500Gi
---
# PostgreSQL standby replica in DR region
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres-standby
namespace: database
labels:
app: postgres
role: standby
region: us-west-2
spec:
serviceName: postgres-standby
replicas: 2
selector:
matchLabels:
app: postgres
role: standby
template:
metadata:
labels:
app: postgres
role: standby
spec:
initContainers:
- name: setup-replication
image: postgres:15.3
command:
- bash
- -c
- |
set -e
if [ ! -f /var/lib/postgresql/data/pgdata/PG_VERSION ]; then
echo "Setting up streaming replication..."
# Create base backup from primary
PGPASSWORD=$REPLICATION_PASSWORD pg_basebackup \
-h postgres-primary.database.svc.cluster.local \
-D /var/lib/postgresql/data/pgdata \
-U replicator \
-X stream \
-c fast \
-P \
-R
# Configure standby settings
cat >> /var/lib/postgresql/data/pgdata/postgresql.auto.conf <<EOF
primary_conninfo = 'host=postgres-primary.database.svc.cluster.local port=5432 user=replicator password=$REPLICATION_PASSWORD application_name=standby1'
primary_slot_name = 'standby1_slot'
restore_command = 'aws s3 cp s3://postgres-wal-archive/%f %p --region us-east-1'
EOF
echo "Replication setup complete"
else
echo "Database already initialized"
fi
env:
- name: REPLICATION_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: replication-password
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
containers:
- name: postgres
image: postgres:15.3
ports:
- containerPort: 5432
name: postgres
env:
- name: PGDATA
value: "/var/lib/postgresql/data/pgdata"
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
resources:
requests:
memory: "8Gi"
cpu: "4000m"
limits:
memory: "16Gi"
cpu: "8000m"
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 500Gi
Redis Cross-Region Replication
# Redis with active-passive replication
---
apiVersion: v1
kind: ConfigMap
metadata:
name: redis-primary-config
namespace: cache
data:
redis.conf: |
bind 0.0.0.0
protected-mode yes
requirepass ${REDIS_PASSWORD}
port 6379
tcp-backlog 511
timeout 0
tcp-keepalive 300
# Persistence
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /data
# Replication
min-replicas-to-write 1
min-replicas-max-lag 10
replica-serve-stale-data no
replica-priority 100
# AOF persistence
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
# Memory management
maxmemory 8gb
maxmemory-policy allkeys-lru
lazyfree-lazy-eviction yes
lazyfree-lazy-expire yes
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-primary
namespace: cache
labels:
app: redis
role: primary
region: us-east-1
dr-tier: tier1
spec:
serviceName: redis-primary
replicas: 1
selector:
matchLabels:
app: redis
role: primary
template:
metadata:
labels:
app: redis
role: primary
annotations:
backup.velero.io/backup-volumes: data
spec:
containers:
- name: redis
image: redis:7.2
command:
- redis-server
- /conf/redis.conf
ports:
- containerPort: 6379
name: redis
env:
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: redis-secret
key: password
volumeMounts:
- name: config
mountPath: /conf
- name: data
mountPath: /data
livenessProbe:
exec:
command:
- redis-cli
- --raw
- incr
- ping
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
exec:
command:
- redis-cli
- --raw
- incr
- ping
initialDelaySeconds: 10
periodSeconds: 5
resources:
requests:
memory: "8Gi"
cpu: "2000m"
limits:
memory: "16Gi"
cpu: "4000m"
volumes:
- name: config
configMap:
name: redis-primary-config
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
---
# Redis replica in DR region
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-replica
namespace: cache
labels:
app: redis
role: replica
region: us-west-2
spec:
serviceName: redis-replica
replicas: 2
selector:
matchLabels:
app: redis
role: replica
template:
metadata:
labels:
app: redis
role: replica
spec:
containers:
- name: redis
image: redis:7.2
command:
- redis-server
- --replicaof
- redis-primary.cache.svc.cluster.local
- "6379"
- --masterauth
- $(REDIS_PASSWORD)
- --requirepass
- $(REDIS_PASSWORD)
ports:
- containerPort: 6379
name: redis
env:
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: redis-secret
key: password
volumeMounts:
- name: data
mountPath: /data
resources:
requests:
memory: "8Gi"
cpu: "2000m"
limits:
memory: "16Gi"
cpu: "4000m"
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
Automated DR Testing and Validation
DR Test Orchestration
#!/usr/bin/env python3
"""
Automated disaster recovery testing framework
"""
import time
import logging
from datetime import datetime
from kubernetes import client, config
import boto3
from typing import Dict, List
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class DRTestOrchestrator:
def __init__(self, primary_region: str, dr_region: str):
self.primary_region = primary_region
self.dr_region = dr_region
self.test_results = []
def run_dr_test(self, test_type: str = "full"):
"""Execute DR test scenario"""
logger.info(f"Starting DR test: {test_type}")
test_start = datetime.now()
try:
# Phase 1: Pre-test validation
logger.info("Phase 1: Pre-test validation")
self.validate_primary_cluster()
self.validate_dr_cluster()
self.validate_backups()
# Phase 2: Simulate failure
logger.info("Phase 2: Simulating primary region failure")
self.simulate_primary_failure()
# Phase 3: Initiate failover
logger.info("Phase 3: Initiating failover to DR region")
failover_start = datetime.now()
self.perform_failover()
failover_duration = (datetime.now() - failover_start).total_seconds()
# Phase 4: Validate DR environment
logger.info("Phase 4: Validating DR environment")
self.validate_dr_services()
self.validate_data_integrity()
self.run_smoke_tests()
# Phase 5: Measure RTO/RPO
logger.info("Phase 5: Measuring RTO/RPO")
rto = failover_duration / 60 # minutes
rpo = self.measure_data_loss() # minutes
# Phase 6: Failback to primary
logger.info("Phase 6: Failing back to primary region")
self.perform_failback()
# Phase 7: Post-test validation
logger.info("Phase 7: Post-test validation")
self.validate_primary_cluster()
test_duration = (datetime.now() - test_start).total_seconds()
# Record results
result = {
'test_type': test_type,
'start_time': test_start.isoformat(),
'duration_seconds': test_duration,
'rto_minutes': rto,
'rpo_minutes': rpo,
'status': 'passed',
'phases_completed': 7
}
self.test_results.append(result)
logger.info(f"DR test completed successfully: RTO={rto:.2f}m, RPO={rpo:.2f}m")
return result
except Exception as e:
logger.error(f"DR test failed: {str(e)}")
result = {
'test_type': test_type,
'start_time': test_start.isoformat(),
'status': 'failed',
'error': str(e)
}
self.test_results.append(result)
raise
def validate_primary_cluster(self):
"""Validate primary cluster health"""
config.load_kube_config(context=f"eks-{self.primary_region}")
v1 = client.CoreV1Api()
# Check node status
nodes = v1.list_node()
ready_nodes = sum(1 for node in nodes.items
if any(c.type == "Ready" and c.status == "True"
for c in node.status.conditions))
if ready_nodes < len(nodes.items) * 0.9:
raise Exception(f"Primary cluster unhealthy: {ready_nodes}/{len(nodes.items)} nodes ready")
logger.info(f"Primary cluster healthy: {ready_nodes} nodes ready")
def validate_dr_cluster(self):
"""Validate DR cluster readiness"""
config.load_kube_config(context=f"eks-{self.dr_region}")
v1 = client.CoreV1Api()
# Check DR deployments are scaled to minimum
apps_v1 = client.AppsV1Api()
deployments = apps_v1.list_deployment_for_all_namespaces(
label_selector="dr-role=passive"
)
for deployment in deployments.items:
if deployment.status.available_replicas < deployment.spec.replicas:
raise Exception(f"DR deployment {deployment.metadata.name} not ready")
logger.info("DR cluster ready for failover")
def validate_backups(self):
"""Validate backup availability and integrity"""
# Check Velero backups
config.load_kube_config(context=f"eks-{self.primary_region}")
custom_api = client.CustomObjectsApi()
backups = custom_api.list_namespaced_custom_object(
group="velero.io",
version="v1",
namespace="velero",
plural="backups",
label_selector="dr-tier=tier1"
)
recent_backups = [b for b in backups['items']
if (datetime.now() - datetime.fromisoformat(
b['status']['startTimestamp'].replace('Z', '+00:00'))
).total_seconds() < 600] # Last 10 minutes
if not recent_backups:
raise Exception("No recent backups found for tier1 applications")
logger.info(f"Found {len(recent_backups)} recent backups")
def simulate_primary_failure(self):
"""Simulate primary region failure"""
# In test mode, just update DNS weights to simulate failure
# In production, this would be an actual failure scenario
logger.info("Simulating primary region failure (test mode)")
time.sleep(5)
def perform_failover(self):
"""Perform failover to DR region"""
config.load_kube_config(context=f"eks-{self.dr_region}")
apps_v1 = client.AppsV1Api()
# Scale up DR deployments
deployments = apps_v1.list_deployment_for_all_namespaces(
label_selector="dr-role=passive"
)
for deployment in deployments.items:
target_replicas = int(
deployment.metadata.annotations.get('dr-target-replicas', '10')
)
deployment.spec.replicas = target_replicas
apps_v1.patch_namespaced_deployment(
name=deployment.metadata.name,
namespace=deployment.metadata.namespace,
body=deployment
)
logger.info(f"Scaled {deployment.metadata.name} to {target_replicas} replicas")
# Wait for deployments to be ready
time.sleep(60)
# Update DNS to point to DR region
self.update_dns_to_dr()
def update_dns_to_dr(self):
"""Update DNS to DR region"""
route53 = boto3.client('route53')
# Update Route53 weighted routing
route53.change_resource_record_sets(
HostedZoneId='Z1234567890ABC',
ChangeBatch={
'Changes': [
{
'Action': 'UPSERT',
'ResourceRecordSet': {
'Name': 'api.example.com',
'Type': 'CNAME',
'SetIdentifier': self.dr_region,
'Weight': 100,
'TTL': 60,
'ResourceRecords': [
{'Value': f'lb-{self.dr_region}.example.com'}
]
}
}
]
}
)
logger.info("Updated DNS to DR region")
def validate_dr_services(self):
"""Validate services running in DR region"""
config.load_kube_config(context=f"eks-{self.dr_region}")
apps_v1 = client.AppsV1Api()
deployments = apps_v1.list_deployment_for_all_namespaces(
label_selector="dr-tier=tier1"
)
for deployment in deployments.items:
if deployment.status.available_replicas < deployment.spec.replicas * 0.9:
raise Exception(f"Deployment {deployment.metadata.name} not healthy in DR")
logger.info("All critical services healthy in DR region")
def validate_data_integrity(self):
"""Validate data integrity after failover"""
# Check database replication lag
# Verify data consistency
# Compare checksums
logger.info("Data integrity validated")
def run_smoke_tests(self):
"""Run smoke tests against DR environment"""
import requests
# Test critical endpoints
endpoints = [
'https://api.example.com/health',
'https://api.example.com/api/v1/status',
]
for endpoint in endpoints:
response = requests.get(endpoint, timeout=10)
if response.status_code != 200:
raise Exception(f"Smoke test failed for {endpoint}: {response.status_code}")
logger.info("Smoke tests passed")
def measure_data_loss(self) -> float:
"""Measure data loss (RPO) in minutes"""
# Compare last transaction in primary vs DR
# Calculate time difference
# Return in minutes
return 3.5 # Example: 3.5 minutes of data loss
def perform_failback(self):
"""Failback to primary region"""
# Restore primary cluster
# Sync data from DR to primary
# Update DNS back to primary
# Scale down DR deployments
logger.info("Failback to primary region completed")
def generate_report(self) -> str:
"""Generate DR test report"""
report = []
report.append("=" * 80)
report.append("DISASTER RECOVERY TEST REPORT")
report.append("=" * 80)
report.append("")
for result in self.test_results:
report.append(f"Test Type: {result['test_type']}")
report.append(f"Start Time: {result['start_time']}")
report.append(f"Status: {result['status']}")
if result['status'] == 'passed':
report.append(f"Duration: {result['duration_seconds']:.2f} seconds")
report.append(f"RTO Achieved: {result['rto_minutes']:.2f} minutes")
report.append(f"RPO Measured: {result['rpo_minutes']:.2f} minutes")
report.append(f"Phases Completed: {result['phases_completed']}/7")
else:
report.append(f"Error: {result.get('error', 'Unknown error')}")
report.append("")
return "\n".join(report)
if __name__ == '__main__':
orchestrator = DRTestOrchestrator(
primary_region='us-east-1',
dr_region='us-west-2'
)
# Run monthly DR test
result = orchestrator.run_dr_test(test_type='full')
# Generate and save report
report = orchestrator.generate_report()
print(report)
with open(f"dr-test-{datetime.now().strftime('%Y%m%d')}.txt", 'w') as f:
f.write(report)
DR Runbook and Procedures
Failover Runbook
#!/bin/bash
# Disaster Recovery Failover Runbook
# Execute this script to failover to DR region
set -e
PRIMARY_REGION="us-east-1"
DR_REGION="us-west-2"
PRIMARY_CLUSTER="prod-us-east"
DR_CLUSTER="prod-us-west"
echo "========================================="
echo "DISASTER RECOVERY FAILOVER PROCEDURE"
echo "========================================="
echo ""
echo "Primary Region: $PRIMARY_REGION"
echo "DR Region: $DR_REGION"
echo ""
read -p "Are you sure you want to proceed with failover? (yes/no): " confirm
if [ "$confirm" != "yes" ]; then
echo "Failover cancelled"
exit 0
fi
echo ""
echo "Step 1: Verify DR cluster health"
kubectl config use-context $DR_CLUSTER
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running || true
read -p "DR cluster appears healthy. Continue? (yes/no): " confirm
if [ "$confirm" != "yes" ]; then exit 1; fi
echo ""
echo "Step 2: Scale up DR deployments"
kubectl config use-context $DR_CLUSTER
# Scale tier1 applications
for ns in $(kubectl get ns -l dr-tier=tier1 -o jsonpath='{.items[*].metadata.name}'); do
echo "Scaling deployments in namespace: $ns"
for deploy in $(kubectl get deploy -n $ns -l dr-role=passive -o jsonpath='{.items[*].metadata.name}'); do
target=$(kubectl get deploy $deploy -n $ns -o jsonpath='{.metadata.annotations.dr-target-replicas}')
echo " Scaling $deploy to $target replicas"
kubectl scale deploy $deploy -n $ns --replicas=$target
done
done
echo ""
echo "Step 3: Wait for deployments to be ready (5 minutes)"
sleep 300
kubectl get deployments --all-namespaces -l dr-tier=tier1
read -p "All deployments ready? (yes/no): " confirm
if [ "$confirm" != "yes" ]; then exit 1; fi
echo ""
echo "Step 4: Update DNS to DR region"
python3 << 'EOF'
import boto3
route53 = boto3.client('route53')
# Update weighted routing to send all traffic to DR
route53.change_resource_record_sets(
HostedZoneId='Z1234567890ABC',
ChangeBatch={
'Comment': 'DR Failover',
'Changes': [
{
'Action': 'UPSERT',
'ResourceRecordSet': {
'Name': 'api.example.com',
'Type': 'CNAME',
'SetIdentifier': 'us-east-1',
'Weight': 0,
'TTL': 60,
'ResourceRecords': [{'Value': 'lb-us-east-1.example.com'}]
}
},
{
'Action': 'UPSERT',
'ResourceRecordSet': {
'Name': 'api.example.com',
'Type': 'CNAME',
'SetIdentifier': 'us-west-2',
'Weight': 100,
'TTL': 60,
'ResourceRecords': [{'Value': 'lb-us-west-2.example.com'}]
}
}
]
}
)
print("DNS updated to DR region")
EOF
echo ""
echo "Step 5: Verify application availability"
sleep 60 # Wait for DNS propagation
for endpoint in "https://api.example.com/health" "https://app.example.com/health"; do
echo "Testing $endpoint"
curl -f $endpoint || echo "WARNING: Endpoint not responding"
done
echo ""
echo "========================================="
echo "FAILOVER COMPLETE"
echo "========================================="
echo ""
echo "Next steps:"
echo "1. Monitor application performance in DR region"
echo "2. Investigate primary region failure"
echo "3. Plan failback procedure when primary is restored"
echo ""
echo "Failover time: $(date)"
Conclusion
Cloud-native disaster recovery requires comprehensive planning, automated tooling, and regular testing to ensure business continuity. By implementing tiered DR strategies, automated backup and replication, cross-region failover capabilities, and validated recovery procedures, organizations can achieve their RTO and RPO objectives while maintaining operational excellence.
Key implementation principles:
- Tiered Approach: Classify applications by criticality and set appropriate RTO/RPO targets
- Automated Backup: Implement continuous backup with Velero and application-aware hooks
- Data Replication: Configure synchronous or asynchronous replication based on DR tier
- Automated Failover: Deploy monitoring and automated failover for mission-critical workloads
- Regular Testing: Conduct DR tests monthly/quarterly to validate procedures
- Documentation: Maintain detailed runbooks and automate where possible
By treating disaster recovery as a first-class concern in your cloud-native architecture, you can build resilient systems that withstand failures while minimizing business impact.