Longhorn Monitoring and Maintenance
This guide covers monitoring and maintenance strategies for Longhorn storage, including metrics collection, alerting, and routine maintenance procedures.
Monitoring Setup
1. Prometheus Integration
ServiceMonitor Configuration
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: longhorn-monitoring
namespace: monitoring
spec:
selector:
matchLabels:
app: longhorn-manager
endpoints:
- port: manager
path: /metrics
interval: 10s
Prometheus Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: longhorn-alerts
spec:
groups:
- name: longhorn.rules
rules:
- alert: LonghornVolumeUsageHigh
expr: longhorn_volume_usage_percentage > 80
for: 5m
labels:
severity: warning
annotations:
description: "Volume {{ $labels.volume }} usage is {{ $value }}%"
- alert: LonghornDiskSpaceLow
expr: longhorn_node_storage_usage_percentage > 85
for: 5m
labels:
severity: critical
2. Grafana Dashboards
Dashboard Configuration
{
"dashboard": {
"id": null,
"title": "Longhorn Storage Overview",
"panels": [
{
"title": "Volume Status",
"type": "gauge",
"datasource": "Prometheus",
"targets": [
{
"expr": "longhorn_volume_robustness",
"legendFormat": "{{volume}}"
}
]
},
{
"title": "Disk Usage",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "longhorn_node_storage_usage_percentage",
"legendFormat": "{{node}}"
}
]
}
]
}
}
Performance Monitoring
1. Key Metrics
Volume Metrics
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: longhorn-volume-metrics
spec:
selector:
matchLabels:
app: longhorn-engine
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 30s
metricRelabelings:
- sourceLabels: [__name__]
regex: 'longhorn_volume_(read|write)_.*'
action: keep
Performance Dashboard
{
"panels": [
{
"title": "IOPS by Volume",
"type": "graph",
"targets": [
{
"expr": "rate(longhorn_volume_read_ops_total[5m])",
"legendFormat": "Read - {{volume}}"
},
{
"expr": "rate(longhorn_volume_write_ops_total[5m])",
"legendFormat": "Write - {{volume}}"
}
]
}
]
}
2. Latency Monitoring
Latency Alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: longhorn-latency-alerts
spec:
groups:
- name: longhorn.latency
rules:
- alert: LonghornHighLatency
expr: rate(longhorn_volume_read_latency_microseconds[5m]) > 1000000
for: 5m
labels:
severity: warning
Maintenance Procedures
1. Volume Maintenance
Volume Health Check
#!/bin/bash
# Volume health check script
# Check volume status
kubectl get volumes -n longhorn-system -o json | jq -r '.items[] | select(.status.state != "healthy") | .metadata.name'
# Check replica count
kubectl get volumes -n longhorn-system -o json | jq -r '.items[] | select(.status.robustness != "healthy") | .metadata.name'
# Check for degraded volumes
kubectl get volumes -n longhorn-system -o json | jq -r '.items[] | select(.status.actualSize != .spec.size) | .metadata.name'
Volume Cleanup
apiVersion: batch/v1
kind: CronJob
metadata:
name: volume-cleanup
spec:
schedule: "0 0 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: cleanup
image: longhornio/longhorn-manager:v1.5.1
command:
- /bin/sh
- -c
- longhorn volume cleanup
2. Node Maintenance
Node Drain Procedure
#!/bin/bash
# Node maintenance script
NODE_NAME=$1
# Cordon the node
kubectl cordon $NODE_NAME
# Evict Longhorn volumes
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | select(.spec.volumes[].persistentVolumeClaim != null) | .metadata.namespace + "/" + .metadata.name' | \
while read pod; do
kubectl delete pod -n ${pod%/*} ${pod#*/}
done
# Wait for volume migration
sleep 300
# Drain the node
kubectl drain $NODE_NAME --ignore-daemonsets --delete-emptydir-data
Storage Health Check
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: storage-health-check
spec:
schedule: "*/30 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: health-check
image: longhornio/longhorn-manager:v1.5.1
command:
- /bin/sh
- -c
- |
longhorn node ls --format json | \
jq -r '.[] | select(.conditions.Ready.status != "True") | .name'
System Maintenance
1. Backup Verification
Backup Health Check
apiVersion: batch/v1
kind: CronJob
metadata:
name: backup-verification
spec:
schedule: "0 0 * * 0"
jobTemplate:
spec:
template:
spec:
containers:
- name: verify
image: longhornio/longhorn-manager:v1.5.1
command:
- /bin/sh
- -c
- |
longhorn backup ls | \
while read backup; do
longhorn backup verify $backup
done
2. System Updates
Update Procedure
#!/bin/bash
# Longhorn update script
# Check current version
current_version=$(kubectl get deployment -n longhorn-system longhorn-manager -o jsonpath='{.spec.template.spec.containers[0].image}' | cut -d: -f2)
# Update Longhorn
helm upgrade longhorn longhorn/longhorn \
--namespace longhorn-system \
--version ${NEW_VERSION} \
--set defaultSettings.backupstorePollInterval=300
# Wait for rollout
kubectl -n longhorn-system rollout status deployment/longhorn-manager
Best Practices
Monitoring Strategy
- Implement comprehensive metrics collection
- Set up appropriate alerting thresholds
- Monitor system and volume health
- Track performance trends
Maintenance Schedule
- Regular health checks
- Proactive volume maintenance
- System updates and patches
- Backup verification
Performance Optimization
- Regular performance audits
- Resource utilization monitoring
- Capacity planning
- Bottleneck identification
Documentation
- Maintenance procedures
- Troubleshooting guides
- Performance baselines
- Update history
Conclusion
Effective monitoring and maintenance requires:
- Comprehensive monitoring setup
- Regular maintenance procedures
- Performance optimization
- Proper documentation
For more information, check out: