Advanced Cilium Troubleshooting Guide for 2025
Troubleshooting Cilium in Kubernetes environments remains challenging, especially as Cilium continues to evolve with more features and capabilities. This advanced guide expands on our previous troubleshooting guide with deeper insights, more detailed diagnostic approaches, and comprehensive resolution paths for complex Cilium issues in 2025.
Introduction and Cilium Architecture Review
Before diving into specific troubleshooting techniques, it’s essential to understand Cilium’s architecture and how its components interact. Cilium operates as a CNI plugin for Kubernetes and leverages eBPF technology to provide networking, security, and observability functions.
Key Components
- Cilium Agent: Runs on each node, responsible for programming eBPF programs
- Cilium Operator: Cluster-scoped component that handles tasks requiring cluster-wide knowledge
- eBPF Programs: The core of Cilium’s functionality, injected into the kernel
- Hubble: Optional observability layer for network traffic visualization
- Cilium CLI: Command-line interface for interacting with Cilium components
Network Flow Paths
Understanding how packets flow through a Cilium-enabled cluster is crucial for troubleshooting:
- Pod Egress: Pod → veth pair → eBPF TC hook → routing decision → destination
- Pod Ingress: Source → node network interface → eBPF TC hook → veth pair → pod
- Service Access: Pod → eBPF service lookup → endpoint selection → destination pod
This knowledge helps identify where in the path issues might be occurring.
Diagnostic Foundations
Setting Proper Logging Levels
Cilium’s default logging level might not provide enough information for advanced troubleshooting. Modify it using:
# Check current logging configuration
kubectl -n kube-system exec -it ds/cilium -- cilium config get debug
# Enable more verbose logging
kubectl -n kube-system exec -it ds/cilium -- cilium config set debug true
For even more granular control:
# Set specific logging levels for components
kubectl -n kube-system exec -it ds/cilium -- cilium config set debug-verbose map[datapath:true policy:true]
Understanding Cilium Logs
Cilium logs contain valuable diagnostic information. Key patterns to look for:
level=info msg="Initializing Cilium"
: Startup sequencelevel=debug msg="Endpoint starting regeneration"
: Endpoint configuration changeslevel=warning msg="Failed to"
: Indication of issueslevel=error
: Critical issues requiring attention
When analyzing logs, use a structured approach:
# Extract error messages
kubectl -n kube-system logs ds/cilium | grep "level=error"
# Look for policy-related messages
kubectl -n kube-system logs ds/cilium | grep "level=debug.*policy"
# Examine startup sequence
kubectl -n kube-system logs ds/cilium | grep "Initializing"
Establishing Baseline Metrics
Before troubleshooting, establish performance baselines:
# Get current Cilium agent metrics
kubectl -n kube-system exec -it ds/cilium -- cilium metrics list
# Focus on key performance indicators
kubectl -n kube-system exec -it ds/cilium -- cilium metrics list | grep "datapath\|endpoint\|policy"
Important metrics to monitor include:
cilium_endpoint_regenerations_total
: Frequency of endpoint regenerationscilium_datapath_errors_total
: Datapath programming failurescilium_policy_import_errors_total
: Policy import issuescilium_datapath_conntrack_gc_duration_seconds
: Connection tracking garbage collection time
Cluster-Level Diagnostics
Agent Health Verification
A thorough health check should be your first step:
# Get all Cilium pods and their status
kubectl -n kube-system get pods -l k8s-app=cilium -o wide
# Check Cilium status on each node
for pod in $(kubectl -n kube-system get pods -l k8s-app=cilium -o name); do
echo "Checking $pod..."
kubectl -n kube-system exec -it $pod -- cilium status --verbose
done
The status output reveals critical information:
- KVStore connectivity
- Kubernetes API server connectivity
- Controller status (should show “0/X failing”)
- Proxy status
- Cluster health
Any failing controllers warrant investigation:
# List failing controllers with details
kubectl -n kube-system exec -it ds/cilium -- cilium status --verbose | grep -A 5 "Controller Status"
Cilium Operator Troubleshooting
The Cilium Operator handles cluster-wide resources like CiliumNetworkPolicies:
# Check operator logs
kubectl -n kube-system logs -l name=cilium-operator
# Check operator status
kubectl -n kube-system get deployment cilium-operator
Common operator issues include:
- CRD synchronization failures
- Webhook configuration problems
- Resource exhaustion (check for OOMKilled in pod events)
Control Plane to Data Plane Synchronization
Verify synchronization between Kubernetes and Cilium:
# Get all Kubernetes services
kubectl get svc --all-namespaces
# Compare with Cilium's service list on a node
kubectl -n kube-system exec -it ds/cilium -- cilium service list
Any discrepancies indicate synchronization issues. Check for:
- Services in Kubernetes not appearing in Cilium
- Endpoints missing in Cilium’s service backends
Kernel Compatibility Verification
Cilium requires specific kernel features. Verify compatibility:
# Run compatibility check
kubectl -n kube-system exec -it ds/cilium -- cilium kernel-check
Issues like missing eBPF maps or programs may indicate kernel incompatibility.
eBPF Map Diagnostics
eBPF maps are key-value stores used by Cilium. Examine them:
# List available maps
kubectl -n kube-system exec -it ds/cilium -- cilium bpf maps list
# Check specific map details
kubectl -n kube-system exec -it ds/cilium -- cilium bpf ct list global
Watch for these issues:
- Maps at capacity (size vs max entries)
- Stale entries
- Unexpected entries
Pod Connectivity Troubleshooting
Pod-to-Pod Communication Path Analysis
When pods can’t communicate, trace the path:
# Identify the Cilium endpoint IDs
kubectl -n kube-system exec -it ds/cilium -- cilium endpoint list
# Examine specific endpoints
kubectl -n kube-system exec -it ds/cilium -- cilium endpoint get <id>
# Trace traffic between endpoints
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --to-endpoint <destination-id> --from-endpoint <source-id>
This traces live traffic between endpoints. Look for:
DROP_POLICY
: Policy violationsDROP_CT_INVALID
: Connection tracking issuesPacket dropped
: Generic drops with reasons
Advanced Connectivity Testing
For more structured testing:
# Deploy the connectivity test
kubectl create ns cilium-test
kubectl apply -n cilium-test -f https://raw.githubusercontent.com/cilium/cilium/v1.15.x/examples/kubernetes/connectivity-check/connectivity-check.yaml
# Verify all pods are running
kubectl -n cilium-test get pods
Analyze any failing pods:
# Get details on failing pods
kubectl -n cilium-test describe pod <failing-pod>
# Check logs
kubectl -n cilium-test logs <failing-pod>
DNS Resolution Issues
Cilium can act as a DNS proxy. Check for DNS issues:
# Verify DNS proxy status
kubectl -n kube-system exec -it ds/cilium -- cilium status | grep DNS
# Check DNS cache entries
kubectl -n kube-system exec -it ds/cilium -- cilium fqdn cache list
Common DNS troubleshooting steps:
- Verify CoreDNS/kube-dns is functioning
- Check Cilium’s DNS proxy configuration
- Examine DNS policies if FQDN policies are in use
- Monitor DNS traffic:
# Monitor DNS traffic
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --type L7 | grep DNS
Packet Capture and Analysis
For deeper analysis, capture packets at the endpoint:
# Find the endpoint ID and security ID
kubectl -n kube-system exec -it ds/cilium -- cilium endpoint list
# Capture traffic for a specific endpoint
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --type drop -n <namespace> -o json
# Export to pcap for Wireshark analysis
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --type drop -n <namespace> --hexdump | cilium-monitor-format > capture.pcap
This provides visibility into:
- Exactly which packets are being dropped
- The drop reason coded in the eBPF programs
- Packet contents for deeper inspection
Traffic Flow Visualization with Hubble
Hubble provides powerful visualization for Cilium traffic:
# Check if Hubble is enabled
kubectl -n kube-system get pods -l k8s-app=hubble
# Enable Hubble if needed (example with Helm)
helm upgrade cilium cilium/cilium --namespace kube-system --reuse-values --set hubble.enabled=true --set hubble.metrics.enabled="{dns,drop,tcp,flow,icmp,http}"
# Access Hubble UI or CLI
export POD_NAME=$(kubectl get pods -n kube-system -l k8s-app=hubble-relay -o name | head -n 1)
kubectl -n kube-system port-forward $POD_NAME 4245:4245 &
hubble status
hubble observe
Hubble provides key insights:
- Visual traffic flow between services and pods
- Policy decision points
- Drop reasons with rich context
- L7 protocol insights (HTTP, gRPC, etc.)
Advanced Policy Troubleshooting
Policy Audit Mode Usage
Audit mode allows testing policies without enforcing them:
# Enable policy audit mode
kubectl -n kube-system exec -it ds/cilium -- cilium config set policy-audit-mode true
# Check policy violations in audit mode
kubectl -n kube-system logs ds/cilium | grep "would be denied by policy"
This helps debug policy issues before enforcement.
Analyzing Policy Enforcement Logs
When policies block traffic, analyze the logs:
# Monitor policy drops
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --type drop | grep "Policy denied"
# Check specific policy status
kubectl -n kube-system exec -it ds/cilium -- cilium policy get
Important details to extract:
- Source and destination identity
- Layer 3/4 information (IP, port, protocol)
- Missing policy rule patterns
Step-by-Step Policy Debugging Workflow
- Identify affected pods:
# Get labels for source and destination pods
kubectl get pod <source-pod> -o json | jq .metadata.labels
kubectl get pod <destination-pod> -o json | jq .metadata.labels
- Check identity mapping:
# Map labels to Cilium identities
kubectl -n kube-system exec -it ds/cilium -- cilium identity list
- Verify policy rules:
# Get policy for destination namespace
kubectl get cnp,cclp -n <namespace> -o yaml
- Test with temporary allow policy:
# Create a temporary allow policy
cat <<EOF | kubectl apply -f -
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: temp-allow
namespace: <namespace>
spec:
endpointSelector:
matchLabels:
app: <destination-app>
ingress:
- fromEndpoints:
- matchLabels:
app: <source-app>
EOF
- Monitor policy resolution:
# Watch policy computation
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --type policy
Policy Test Frameworks
For automated policy testing, use the Cilium policy test framework:
# Clone Cilium repository for the test framework
git clone https://github.com/cilium/cilium.git
cd cilium/test/k8s/manifests
# Deploy test pods
kubectl apply -f policy-test/
# Run connectivity tests
cd ../../..
go test ./test/k8s/policy_test.go -run TestPolicyEnforcement
Performance Diagnostics
Identifying eBPF Program Bottlenecks
Performance issues might stem from eBPF programs:
# Check eBPF program performance
kubectl -n kube-system exec -it ds/cilium -- cilium bpf program list
# Get detailed map information
kubectl -n kube-system exec -it ds/cilium -- cilium bpf map list
Look for:
- High iteration counts in maps
- Programs with significant runtime
- Map sizes approaching limits
Connection Tracking Table Troubleshooting
Connection tracking (CT) tables can cause performance issues:
# Check CT table statistics
kubectl -n kube-system exec -it ds/cilium -- cilium bpf ct list global | wc -l
# Get CT timeouts
kubectl -n kube-system exec -it ds/cilium -- cilium config | grep "conntrack-"
Common issues:
- CT table at max capacity
- Stale connections not being pruned
- Aggressive timeout settings
# Manually trigger garbage collection
kubectl -n kube-system exec -it ds/cilium -- cilium bpf ct flush global
CPU and Memory Resource Constraints Analysis
Resource constraints often manifest as performance issues:
# Check Cilium pod resource usage
kubectl -n kube-system top pod -l k8s-app=cilium
# Examine node capacity
kubectl -n kube-system describe node <node-name> | grep -A 5 "Capacity"
Look for:
- CPU throttling
- Memory pressure
- OOMKilled events in pod logs
Service Mesh Integration Issues
Envoy Proxy Configuration Troubleshooting
When using Cilium’s Envoy integration:
# Check Envoy status
kubectl -n kube-system exec -it ds/cilium -- cilium status --verbose | grep -A 10 Proxy
# Get Envoy config
kubectl -n kube-system exec -it ds/cilium -- cilium bpf proxy list
# View Envoy access logs
kubectl -n kube-system exec -it ds/cilium -- cilium proxy get access-log
Common issues:
- Certificate problems
- XDS configuration failures
- Listener configuration errors
L7 Policy Enforcement Debugging
L7 policies require special troubleshooting:
# Monitor L7 traffic
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --type l7
# Check specific L7 policy
kubectl -n kube-system exec -it ds/cilium -- cilium policy selectors | grep "L7"
For HTTP policy issues:
- Check if traffic is reaching Envoy
- Verify HTTP headers match policy requirements
- Examine Envoy logs for policy decisions
Upgrade and Migration Troubles
Pre-flight Checks for Upgrades
Before upgrading Cilium:
# Verify cluster health
kubectl -n kube-system exec -it ds/cilium -- cilium status
kubectl -n kube-system exec -it ds/cilium -- cilium connectivity test
# Check for blocking issues
kubectl -n kube-system exec -it ds/cilium -- cilium preflight verify --validate-cnp
Common Upgrade Failure Patterns
During upgrades, watch for:
- Configuration option deprecations
- CRD version incompatibilities
- Kernel feature requirements
- IPAM mode migration issues
If an upgrade fails:
# Check failed pods
kubectl -n kube-system describe pod -l k8s-app=cilium | grep -A 10 "Events:"
# Look for compatibility issues
kubectl -n kube-system logs -l k8s-app=cilium | grep "incompatible"
Rollback Procedures
If rollback is necessary:
- Restore the previous Cilium version:
# Example with Helm
helm rollback cilium <previous-revision> -n kube-system
- Verify post-rollback:
kubectl -n kube-system wait --for=condition=ready pods -l k8s-app=cilium
kubectl -n kube-system exec -it ds/cilium -- cilium status
Advanced CLI Techniques
Deep Dive into Cilium Status Output
The cilium status
command provides extensive diagnostic information:
kubectl -n kube-system exec -it ds/cilium -- cilium status --verbose
Key sections to analyze:
- KVStore: Must show “Connected” for proper operation
- Controllers: Should show 0 failing controllers
- Proxy Status: Should be “OK” if L7 policies are used
- IPAM: Shows IP address allocation stats
- Encryption: Shows encryption status if enabled
Advanced Cilium Monitor Usage with Filters
cilium monitor
is a powerful tool with filtering capabilities:
# Monitor specific endpoints
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --from-endpoint <id>
# Filter by verdict
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --type drop
# Filter by policy verdict
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --type policy-verdict
# Filter by protocol
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --type capture -n <namespace> | grep TCP
For complex filtering, use the format option with jq:
kubectl -n kube-system exec -it ds/cilium -- cilium monitor -o json | jq 'select(.summary.verdict == "DROPPED") | {source: .source, destination: .destination, drop_reason: .summary.reason}'
BPF Map Inspection and Manipulation
Advanced map troubleshooting techniques:
# List all maps
kubectl -n kube-system exec -it ds/cilium -- cilium bpf maps list
# Examine neighbors table
kubectl -n kube-system exec -it ds/cilium -- cilium bpf neighbor list
# Check policy maps
kubectl -n kube-system exec -it ds/cilium -- cilium bpf policy get <endpoint-id>
# Inspect routing table
kubectl -n kube-system exec -it ds/cilium -- cilium bpf lb list
For detailed policy analysis:
# Get endpoint policy by ID
kubectl -n kube-system exec -it ds/cilium -- cilium bpf policy get <endpoint-id> -n <namespace> --all-layers
Automated Troubleshooting
Building Diagnostic Automation
Create scripts for common diagnostic tasks:
# Example bash script for comprehensive Cilium diagnostics
cat > cilium-diag.sh << 'EOF'
#!/bin/bash
echo "===== Cilium Agent Status ====="
kubectl -n kube-system exec -it ds/cilium -- cilium status --verbose
echo "===== Controller Status ====="
kubectl -n kube-system exec -it ds/cilium -- cilium status --verbose | grep -A 20 "Controller Status"
echo "===== Policy Status ====="
kubectl -n kube-system exec -it ds/cilium -- cilium policy get
echo "===== Endpoint Status ====="
kubectl -n kube-system exec -it ds/cilium -- cilium endpoint list
echo "===== Service Status ====="
kubectl -n kube-system exec -it ds/cilium -- cilium service list
echo "===== BPF Maps ====="
kubectl -n kube-system exec -it ds/cilium -- cilium bpf maps list
echo "===== Recent Policy Drops ====="
kubectl -n kube-system logs -l k8s-app=cilium --tail=50 | grep "Policy denied"
EOF
chmod +x cilium-diag.sh
Continuous Verification Techniques
Implement regular health checks:
# Kubernetes CronJob for regular Cilium health checks
cat << EOF | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
name: cilium-health-check
namespace: kube-system
spec:
schedule: "*/30 * * * *"
jobTemplate:
spec:
template:
spec:
serviceAccountName: cilium
containers:
- name: cilium-health
image: cilium/cilium:latest
command:
- /bin/sh
- -c
- |
cilium status --verbose
cilium connectivity test
restartPolicy: OnFailure
EOF
Case Studies with Resolution Paths
Case 1: Intermittent Pod Connectivity
Symptoms: Pods occasionally unable to communicate, with intermittent timeouts
Diagnostic Steps:
- Check endpoint status:
kubectl -n kube-system exec -it ds/cilium -- cilium endpoint list
- Look for “not-ready” or unhealthy endpoints:
kubectl -n kube-system exec -it ds/cilium -- cilium endpoint list | grep -v "ready"
- Monitor for drops during connectivity attempts:
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --type drop
- Check for controller failures:
kubectl -n kube-system exec -it ds/cilium -- cilium status | grep "controller"
Common Causes and Resolutions:
- Connection tracking table full: Increase CT table size or adjust timeouts
- Identity resolution failures: Check kvstore connectivity
- Endpoint regeneration failures: Examine agent logs for specific errors
- Network policy conflicts: Audit policies and simplify where possible
Case 2: Service Load Balancing Failures
Symptoms: Service backends unreachable or load balancing uneven
Diagnostic Steps:
- Verify service is properly defined in Cilium:
kubectl -n kube-system exec -it ds/cilium -- cilium service list
- Check backend selection:
kubectl -n kube-system exec -it ds/cilium -- cilium bpf lb list
- Monitor service access attempts:
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --type trace | grep <service-ip>
Common Causes and Resolutions:
- Backend sync issues: Restart Cilium agent to force resync
- Selector issues: Verify pod labels match service selectors
- BPF map limits: Check if BPF LB map is at capacity
- Health checks failing: Check readiness probe configuration
Case 3: Policy Enforcement Inconsistencies
Symptoms: Unexpected traffic blocks or allowances across the cluster
Diagnostic Steps:
- Check policy status and compute resources:
kubectl -n kube-system exec -it ds/cilium -- cilium policy get
- Verify policy selectors match intended pods:
kubectl -n kube-system exec -it ds/cilium -- cilium policy selectors
- Monitor policy verdicts during traffic:
kubectl -n kube-system exec -it ds/cilium -- cilium monitor --type policy-verdict
Common Causes and Resolutions:
- Policy ordering issues: Check policy priorities and CRD evaluation order
- Selector mismatches: Verify label selectors match intended pods
- Reserved labels conflicts: Avoid conflicts with Cilium’s reserved labels
- Stale identities: Restart affected endpoints to refresh identity
Toolbox
Enhanced Collection Scripts
For comprehensive diagnostics, use the Cilium sysdump tool:
# Download the latest sysdump
curl -sLO https://github.com/cilium/cilium-sysdump/releases/latest/download/cilium-sysdump.zip
python cilium-sysdump.zip --collector-params 'pod-logs=true' --since 1h
This collects:
- Pod logs
- Endpoint and service information
- Controller status
- BPF maps and programs
- Kubernetes resources
Custom Debugging Environments
For complex issues, create a debug pod with advanced tools:
# Deploy a debug pod
cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: cilium-debug
namespace: kube-system
spec:
selector:
matchLabels:
app: cilium-debug
template:
metadata:
labels:
app: cilium-debug
spec:
hostNetwork: true
containers:
- name: debug
image: cilium/cilium:latest
command: ["sleep", "1000000"]
securityContext:
privileged: true
nodeSelector:
kubernetes.io/hostname: <problematic-node>
EOF
# Access the debug pod
kubectl -n kube-system exec -it $(kubectl -n kube-system get pod -l app=cilium-debug -o name) -- bash
This provides a full Cilium environment for debugging directly on the problematic node.
Conclusion
Cilium troubleshooting in 2025 requires a methodical approach and understanding of its internal architecture. By following the diagnostic techniques in this guide, you can effectively identify and resolve even the most complex Cilium issues.
Remember these key principles:
- Start with a systematic approach to narrow down the problem area
- Use the right diagnostic tools for each layer of the stack
- When in doubt, increase logging verbosity and use packet captures
- Leverage Hubble for visual traffic analysis when available
- For persistent issues, collect comprehensive diagnostics with sysdump
For the latest updates and detailed documentation, always refer to the official Cilium documentation.