CKA Prep: Part 7 – Cluster Maintenance
Cluster Maintenance Overview
Cluster maintenance is a critical aspect of Kubernetes administration and a significant part of the CKA exam. This section covers key maintenance activities including node management, upgrades, and backup/restore procedures.
Node Maintenance
Cordoning and Draining Nodes
When performing maintenance on a node, you need to safely move workloads away from it:
Cordoning a Node
Cordoning marks a node as unschedulable, preventing new pods from being scheduled on it.
# Mark a node as unschedulable
kubectl cordon node01
Draining a Node
Draining evicts all pods from a node (except mirror pods and pods not managed by a controller) and marks it as unschedulable.
# Drain a node
kubectl drain node01 --ignore-daemonsets
# Drain a node forcefully (even if it has pods not managed by a controller)
kubectl drain node01 --ignore-daemonsets --force
Uncordoning a Node
After maintenance is complete, make the node schedulable again.
# Mark a node as schedulable
kubectl uncordon node01
Key Flags for kubectl drain
--ignore-daemonsets
: Ignores DaemonSet-managed pods--delete-emptydir-data
: Allows deletion of pods using emptyDir volumes--force
: Continues even if there are pods not managed by a controller--grace-period=<seconds>
: Period of time to wait before force terminating pods--timeout=<duration>
: The length of time to wait before giving up
Kubernetes Upgrades
The CKA exam often includes tasks related to cluster upgrades. Understanding the upgrade process is crucial.
Upgrade Process Overview
- Upgrade the control plane components
- Upgrade worker nodes
- Upgrade kubectl on admin workstations
- Verify the upgrade
kubeadm Upgrade Workflow
1. Pre-upgrade Checks
# Check the current version
kubectl version
# Check the upgrade plan
kubeadm upgrade plan
2. Upgrading the Control Plane
# Update package lists
apt update
# Upgrade kubeadm
apt-get install -y kubeadm=1.26.0-00 # Replace with target version
# Plan the upgrade
kubeadm upgrade plan
# Apply the upgrade (on control-plane node)
kubeadm upgrade apply v1.26.0 # Replace with target version
# Upgrade kubelet and kubectl
apt-get install -y kubelet=1.26.0-00 kubectl=1.26.0-00 # Replace with target version
# Restart kubelet
systemctl daemon-reload
systemctl restart kubelet
3. Upgrading Worker Nodes
For each worker node:
# (On the worker node) Drain the node from the control plane
kubectl drain node01 --ignore-daemonsets
# (On the worker node) Update kubeadm
apt-get update
apt-get install -y kubeadm=1.26.0-00 # Replace with target version
# (On the worker node) Upgrade node configuration
kubeadm upgrade node
# (On the worker node) Upgrade kubelet
apt-get install -y kubelet=1.26.0-00 kubectl=1.26.0-00 # Replace with target version
# (On the worker node) Restart kubelet
systemctl daemon-reload
systemctl restart kubelet
# (On the control plane) Make the node schedulable again
kubectl uncordon node01
4. Verifying the Upgrade
# Check the status of all nodes
kubectl get nodes
# Verify component versions
kubectl version
kubectl get nodes -o wide
Cluster Backup and Restore
Backing up and restoring a Kubernetes cluster is another important topic for the CKA exam.
Key Components to Back Up
- etcd data: Contains all cluster state
- Application data: Persistent volumes used by applications
- Cluster configuration: Certificates, kubeconfig files, etc.
Backup and Restore etcd
etcd is the most critical component to back up as it contains all the cluster state information.
Backing Up etcd
Using etcdctl:
# Using etcdctl with ETCDCTL_API=3
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /tmp/etcd-backup.db
Verify the backup:
ETCDCTL_API=3 etcdctl --write-out=table snapshot status /tmp/etcd-backup.db
Restoring etcd from Backup
# Stop the API server
systemctl stop kube-apiserver
# Restore the snapshot
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--data-dir=/var/lib/etcd-backup \
snapshot restore /tmp/etcd-backup.db
# Update etcd configuration to use the restored data directory
# Edit /etc/kubernetes/manifests/etcd.yaml to point to the new data directory
# Restart kubelet to pick up the changes
systemctl restart kubelet
# Start the API server
systemctl start kube-apiserver
Backing Up Kubernetes Resources
You can also back up Kubernetes resources using kubectl:
# Back up all resources in all namespaces
kubectl get all --all-namespaces -o yaml > all-resources.yaml
# Back up specific resource types
kubectl get deployments -A -o yaml > deployments.yaml
kubectl get services -A -o yaml > services.yaml
kubectl get configmaps -A -o yaml > configmaps.yaml
kubectl get secrets -A -o yaml > secrets.yaml
kubectl get pv -o yaml > persistent-volumes.yaml
kubectl get pvc -A -o yaml > persistent-volume-claims.yaml
OS Upgrades
Sometimes you need to perform OS-level upgrades on cluster nodes.
General Procedure
Prepare the node:
- Drain the node to evacuate workloads
- Mark it as unschedulable
Perform the upgrade:
- Execute OS upgrade procedure
- Reboot if necessary
Return the node to service:
- Verify node health
- Mark it as schedulable
# Drain the node
kubectl drain node01 --ignore-daemonsets
# Perform OS upgrade (example for Ubuntu)
ssh node01 "sudo apt update && sudo apt upgrade -y"
# Reboot if necessary
ssh node01 "sudo reboot"
# Wait for node to be ready
kubectl get nodes node01 -w
# Uncordon the node
kubectl uncordon node01
Monitoring and Resource Management
Understanding how to monitor cluster resources is also important for the CKA exam.
Node and Pod Metrics
You can use the Metrics Server to collect resource utilization data.
# Deploy Metrics Server (if not already installed)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# View node metrics
kubectl top nodes
# View pod metrics
kubectl top pods -A
Resource Quotas and Limits
You can use ResourceQuotas to limit resource consumption at the namespace level:
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: development
spec:
hard:
pods: "10"
requests.cpu: "4"
requests.memory: 5Gi
limits.cpu: "8"
limits.memory: 10Gi
LimitRanges
LimitRanges set default, minimum, and maximum resource constraints for pods:
apiVersion: v1
kind: LimitRange
metadata:
name: resource-limits
namespace: development
spec:
limits:
- type: Container
default:
cpu: 500m
memory: 256Mi
defaultRequest:
cpu: 100m
memory: 50Mi
max:
cpu: 2
memory: 2Gi
min:
cpu: 50m
memory: 10Mi
Sample Exam Questions
Question 1: Drain a Node
Task: The node worker01
needs to undergo maintenance. Safely evict all the pods from it, ensuring the workloads are moved to other nodes in the cluster. Ignore DaemonSets during the eviction.
Solution:
# Drain the node
kubectl drain worker01 --ignore-daemonsets
# Verify the node is marked as SchedulingDisabled
kubectl get nodes worker01
Question 2: Upgrade Kubernetes
Task: The cluster is currently running Kubernetes v1.25.0, and you need to upgrade it to v1.26.0. Start by upgrading the control plane components on master
node.
Solution:
# First, check the current version
kubectl version --short
# Update package lists
ssh master "sudo apt update"
# Upgrade kubeadm on the master node
ssh master "sudo apt-get install -y kubeadm=1.26.0-00"
# Check the upgrade plan
ssh master "sudo kubeadm upgrade plan"
# Apply the upgrade
ssh master "sudo kubeadm upgrade apply v1.26.0"
# Upgrade kubelet and kubectl
ssh master "sudo apt-get install -y kubelet=1.26.0-00 kubectl=1.26.0-00"
# Restart kubelet
ssh master "sudo systemctl daemon-reload && sudo systemctl restart kubelet"
# Verify the upgrade
kubectl get nodes
Question 3: Backup etcd
Task: Create a snapshot backup of the etcd database on the control plane node. Save the backup to /tmp/etcd-backup.db
.
Solution:
# Get the etcd pod name
kubectl get pods -n kube-system | grep etcd
# Get the etcd endpoints and certificate paths
# If etcd is running as a static pod (common in kubeadm setups)
ssh master "sudo ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /tmp/etcd-backup.db"
# Verify the backup
ssh master "sudo ETCDCTL_API=3 etcdctl --write-out=table snapshot status /tmp/etcd-backup.db"
Question 4: Restore etcd
Task: The etcd database is corrupted. Restore it from a backup file located at /tmp/etcd-backup.db
.
Solution:
# Stop the API server
ssh master "sudo systemctl stop kube-apiserver"
# Restore from the backup
ssh master "sudo ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--data-dir=/var/lib/etcd-restored \
snapshot restore /tmp/etcd-backup.db"
# Update etcd pod manifest to use the new data directory
ssh master "sudo sed -i 's|/var/lib/etcd|/var/lib/etcd-restored|g' /etc/kubernetes/manifests/etcd.yaml"
# Wait for the etcd pod to restart
kubectl get pods -n kube-system -w | grep etcd
# Start the API server
ssh master "sudo systemctl start kube-apiserver"
# Verify cluster functionality
kubectl get nodes
Key Tips for Cluster Maintenance
Master node operations:
- Always be extra careful when working on control plane nodes
- Understand the implications of each change
- Have a rollback plan ready
Node draining best practices:
- Always use
--ignore-daemonsets
when draining - Consider using
--timeout
to avoid hanging operations - If needed, use
--force
with caution
- Always use
Upgrade sequence:
- Always upgrade components in the right order (kubeadm, control plane, kubelet)
- Upgrade one node at a time
- Test thoroughly after each step
etcd backup procedures:
- Take regular backups of etcd
- Store backups in a safe location
- Practice restoration procedures
Resource management:
- Implement ResourceQuotas and LimitRanges
- Monitor resource usage regularly
- Set appropriate requests and limits on workloads
Practice Exercises
To reinforce your understanding, try these exercises in your practice environment:
- Perform a full cluster upgrade from one minor version to the next
- Create and restore an etcd backup
- Drain nodes for maintenance and then return them to service
- Implement resource quotas in a namespace and test their enforcement
- Set up a LimitRange and observe its effects on new pods
- Practice recovering from various failure scenarios (node failure, etcd failure, etc.)
- Simulate an OS upgrade procedure on a worker node
What’s Next
In the next part, we’ll explore Kubernetes Troubleshooting techniques, covering:
- Application Failure
- Control Plane Failures
- Worker Node Failures
- Networking Issues
- Storage Problems
- Resource Constraints
👉 Continue to Part 8: Troubleshooting