Longhorn Troubleshooting Guide
This guide provides detailed troubleshooting steps for common Longhorn issues, including diagnostics, solutions, and preventive measures.
Common Issues
1. Volume Creation Failures
Symptoms
- PVC remains in Pending state
- Volume creation times out
- Error in volume creation events
Diagnosis
# Check PVC status
kubectl get pvc
# Check volume events
kubectl -n longhorn-system get events | grep volume-name
# Check Longhorn manager logs
kubectl -n longhorn-system logs -l app=longhorn-manager
Solutions
# Verify storage class
kubectl get storageclass longhorn -o yaml
# Check node capacity
kubectl -n longhorn-system get node -o yaml
# Force delete stuck volume
kubectl -n longhorn-system delete volume stuck-volume --force
2. Volume Attachment Issues
Symptoms
- Pod stuck in ContainerCreating
- Volume fails to mount
- IO errors in application logs
Diagnosis
# Check pod events
kubectl describe pod pod-name
# Check volume status
kubectl -n longhorn-system get volume volume-name -o yaml
# Check instance manager logs
kubectl -n longhorn-system logs -l app=longhorn-instance-manager
Solutions
# Force detach volume
kubectl -n longhorn-system patch volume volume-name \
--type='json' -p='[{"op": "replace", "path": "/spec/nodeID", "value":""}]'
# Restart instance manager
kubectl -n longhorn-system delete pod -l app=longhorn-instance-manager
Replica Issues
1. Replica Rebuilding Problems
Diagnosis
# Check replica status
kubectl -n longhorn-system get replicas
# Monitor rebuild progress
kubectl -n longhorn-system logs \
$(kubectl -n longhorn-system get pod -l app=longhorn-manager -o jsonpath='{.items[0].metadata.name}') \
| grep rebuild
Solutions
# Force rebuild replica
kubectl -n longhorn-system exec -it longhorn-manager-xxx -- \
longhorn-manager replica-rebuild volume-name
# Clean up failed replicas
kubectl -n longhorn-system delete replica failed-replica
2. Replica Scheduling Issues
Diagnosis
# Check node conditions
kubectl -n longhorn-system get nodes
# Verify disk status
kubectl -n longhorn-system get node node-name -o yaml
Solutions
# Update node disk configuration
apiVersion: longhorn.io/v1beta1
kind: Node
metadata:
name: worker-1
spec:
disks:
default-disk:
path: /var/lib/longhorn
allowScheduling: true
storageReserved: 10Gi
Performance Issues
1. Slow IO Performance
Diagnosis
# Check disk latency
kubectl -n longhorn-system exec -it longhorn-manager-xxx -- \
longhorn-manager info volume | grep latency
# Monitor IO metrics
kubectl -n longhorn-system get metrics volume-name
Solutions
# Optimize volume settings
apiVersion: longhorn.io/v1beta1
kind: Volume
metadata:
name: volume-name
spec:
numberOfReplicas: 2
dataLocality: "best-effort"
replicaAutoBalance: "best-effort"
2. Network Performance
Diagnosis
# Check network metrics
kubectl -n longhorn-system exec -it longhorn-manager-xxx -- \
longhorn-manager info network
# Monitor network latency
kubectl -n longhorn-system get metrics network
Solutions
# Configure network QoS
kubectl -n longhorn-system patch settings network-quality \
--type='json' -p='[{"op": "replace", "path": "/value", "value":"high"}]'
Backup and Restore Issues
1. Backup Failures
Diagnosis
# Check backup status
kubectl -n longhorn-system get backupvolume
# Verify backup target
kubectl -n longhorn-system get backuptarget
# Check backup logs
kubectl -n longhorn-system logs -l app=longhorn-backup-controller
Solutions
# Verify backup credentials
kubectl -n longhorn-system get secret aws-credentials -o yaml
# Force backup cleanup
kubectl -n longhorn-system delete backup failed-backup
2. Restore Failures
Diagnosis
# Check restore status
kubectl -n longhorn-system get restorevolume
# Monitor restore progress
kubectl -n longhorn-system logs \
$(kubectl -n longhorn-system get pod -l app=longhorn-manager -o jsonpath='{.items[0].metadata.name}') \
| grep restore
Solutions
# Clean up failed restore
kubectl -n longhorn-system delete volume failed-restore
# Retry restore with different options
kubectl -n longhorn-system create -f - <<EOF
apiVersion: longhorn.io/v1beta1
kind: Volume
metadata:
name: restored-volume
spec:
fromBackup: backupstore:///backup-name
numberOfReplicas: 2
staleReplicaTimeout: 60
EOF
System Recovery
1. Node Recovery
Steps
# 1. Cordon node
kubectl cordon node-name
# 2. Evacuate volumes
kubectl -n longhorn-system annotate node node-name \
node.longhorn.io/evacuate=true
# 3. Wait for volume migration
kubectl -n longhorn-system get volumes
# 4. Maintenance work
systemctl restart iscsid
# 5. Uncordon node
kubectl uncordon node-name
2. Volume Recovery
Steps
# 1. Force detach volume
kubectl -n longhorn-system patch volume volume-name \
--type='json' -p='[{"op": "replace", "path": "/spec/nodeID", "value":""}]'
# 2. Delete failed replicas
kubectl -n longhorn-system delete replica failed-replica
# 3. Recreate replicas
kubectl -n longhorn-system patch volume volume-name \
--type='json' -p='[{"op": "replace", "path": "/spec/numberOfReplicas", "value":3}]'
# 4. Verify recovery
kubectl -n longhorn-system get volume volume-name -o yaml
Preventive Measures
- Regular Health Checks
# Daily health check script
#!/bin/bash
kubectl -n longhorn-system get volumes
kubectl -n longhorn-system get nodes
kubectl -n longhorn-system get replicas
kubectl -n longhorn-system get events
- Monitoring Setup
# Prometheus alert rule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: longhorn-alerts
spec:
groups:
- name: longhorn
rules:
- alert: LonghornVolumeActualSizeHigh
expr: longhorn_volume_actual_size_bytes > 0.9 * longhorn_volume_size_bytes
for: 5m
labels:
severity: warning
- Backup Verification
# Verify backup integrity
kubectl -n longhorn-system exec -it longhorn-manager-xxx -- \
longhorn-manager backup verify backup-name
Best Practices
Regular Maintenance
- Schedule regular backups
- Monitor system resources
- Keep Longhorn updated
- Regular health checks
Configuration Management
- Version control settings
- Document changes
- Regular audits
Monitoring
- Set up alerts
- Monitor metrics
- Regular log review
For more information, check out: