Longhorn High Availability Configuration
This guide covers high availability configuration for Longhorn storage, including replica management, node failure handling, and disaster recovery strategies.
Replica Management
1. Volume Replica Configuration
Basic HA Configuration
apiVersion: longhorn.io/v1beta1
kind: Volume
metadata:
name: ha-volume
spec:
numberOfReplicas: 3
dataLocality: "best-effort"
replicaAutoBalance: "least-effort"
nodeSelector:
zone: "us-east-1a"
Advanced Replica Settings
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-ha
provisioner: driver.longhorn.io
parameters:
numberOfReplicas: "3"
staleReplicaTimeout: "30"
fromBackup: ""
replicaZoneSoftAntiAffinity: "true"
replicaAutoBalance: "best-effort"
2. Node Affinity Rules
Zone-based Distribution
apiVersion: longhorn.io/v1beta1
kind: Node
metadata:
name: storage-node-1
spec:
tags:
- "zone-a"
disks:
disk1:
path: /mnt/disk1
allowScheduling: true
tags: ["ssd"]
Anti-affinity Configuration
apiVersion: longhorn.io/v1beta1
kind: Setting
metadata:
name: replica-zone-soft-anti-affinity
value: "true"
Failure Handling
1. Node Failure Recovery
Automatic Recovery Settings
apiVersion: longhorn.io/v1beta1
kind: Setting
metadata:
name: replica-replenishment-wait-interval
value: "600"
---
apiVersion: longhorn.io/v1beta1
kind: Setting
metadata:
name: failed-replica-rebuild-timeout
value: "30"
Node Drain Handling
apiVersion: longhorn.io/v1beta1
kind: Setting
metadata:
name: kubernetes-node-down-policy
value: "delete-local-data"
2. Volume Failover
Failover Configuration
apiVersion: longhorn.io/v1beta1
kind: Volume
metadata:
name: failover-volume
spec:
numberOfReplicas: 3
replicaAutoBalance: "best-effort"
failurePolicy:
type: "failover"
timeout: 300
Recovery Settings
apiVersion: longhorn.io/v1beta1
kind: Setting
metadata:
name: concurrent-replica-rebuild-per-node-limit
value: "5"
Network Resilience
1. Network Redundancy
Multi-network Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: longhorn-network-config
namespace: longhorn-system
data:
network-redundancy: |
{
"primary": "eth0",
"backup": "eth1"
}
Network Policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: longhorn-network-policy
spec:
podSelector:
matchLabels:
app: longhorn
ingress:
- from:
- podSelector:
matchLabels:
app: longhorn
ports:
- protocol: TCP
port: 9500
2. Load Balancing
Service Configuration
apiVersion: v1
kind: Service
metadata:
name: longhorn-frontend
spec:
type: LoadBalancer
selector:
app: longhorn-ui
ports:
- port: 80
targetPort: 8000
Data Protection
1. Volume Snapshot Configuration
Automated Snapshots
apiVersion: longhorn.io/v1beta1
kind: RecurringJob
metadata:
name: snapshot-hourly
spec:
cron: "0 * * * *"
task: "snapshot"
groups:
- default
retain: 24
concurrency: 2
Snapshot Class
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: longhorn-snapshot-class
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
type: snap
2. Data Consistency
Consistent Snapshot Settings
apiVersion: longhorn.io/v1beta1
kind: Setting
metadata:
name: snapshot-data-integrity
value: "enabled"
Monitoring and Alerts
1. Health Checks
Liveness Probe
apiVersion: v1
kind: Pod
metadata:
name: longhorn-manager
spec:
containers:
- name: longhorn-manager
livenessProbe:
httpGet:
path: /healthz
port: 9500
initialDelaySeconds: 15
periodSeconds: 10
Alert Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: longhorn-ha-alerts
spec:
groups:
- name: longhorn.rules
rules:
- alert: LonghornVolumeReplicaCountLow
expr: longhorn_volume_replica_count < 3
for: 5m
labels:
severity: warning
Best Practices
Replica Management
- Maintain at least 3 replicas for critical volumes
- Enable replica auto-balancing
- Configure zone-based anti-affinity
- Regular replica health checks
Failure Recovery
- Set appropriate timeout values
- Configure automatic recovery
- Implement proper drain procedures
- Monitor recovery progress
Network Configuration
- Use redundant networks
- Implement proper security policies
- Configure load balancing
- Monitor network health
Data Protection
- Regular snapshot scheduling
- Validate snapshot consistency
- Monitor snapshot status
- Test recovery procedures
Conclusion
Implementing high availability in Longhorn requires:
- Proper replica configuration
- Effective failure handling
- Network resilience
- Robust monitoring
For more information, check out: