Deep Dive: Kubernetes Scheduler
The Kubernetes Scheduler is responsible for assigning pods to nodes based on various constraints and policies. This deep dive explores its architecture, scheduling algorithms, and configuration options.
Architecture Overview
Component Architecture
API Server -> Scheduler -> Scheduling Queue
-> Scheduling Cycle
-> Binding Cycle
Key Components
Scheduling Queue
- Priority Queue
- Active/Backoff Queues
- Event Handlers
Scheduling Cycle
- Node Filtering
- Node Scoring
- Node Selection
Binding Cycle
- Volume Binding
- Pod Binding
- Post-Binding
Scheduling Process
1. Filtering Phase
// Node filtering example
type FilterPlugin interface {
Filter(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status
}
// Example filter plugin
type NodeResourcesFit struct {...}
func (pl *NodeResourcesFit) Filter(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status {
if nodeHasEnoughResources(nodeInfo, pod) {
return nil
}
return framework.NewStatus(framework.Unschedulable, "Insufficient resources")
}
2. Scoring Phase
// Node scoring example
type ScorePlugin interface {
Score(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) (int64, *Status)
}
// Example score plugin
type NodeResourcesBalancedAllocation struct {...}
func (pl *NodeResourcesBalancedAllocation) Score(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) (int64, *Status) {
// Calculate resource balance score
return calculateBalanceScore(node, pod), nil
}
Scheduler Configuration
1. Scheduler Profiles
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
score:
disabled:
- name: NodeResourcesLeastAllocated
enabled:
- name: NodeResourcesMostAllocated
weight: 1
2. Custom Scheduler
apiVersion: v1
kind: Pod
metadata:
name: custom-scheduled-pod
spec:
schedulerName: my-custom-scheduler
containers:
- name: container
image: nginx
Scheduling Policies
1. Node Affinity
apiVersion: v1
kind: Pod
metadata:
name: affinity-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2
2. Pod Affinity/Anti-Affinity
apiVersion: v1
kind: Pod
metadata:
name: pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
topologyKey: topology.kubernetes.io/zone
Resource Management
1. Resource Requests and Limits
apiVersion: v1
kind: Pod
metadata:
name: resource-pod
spec:
containers:
- name: app
image: nginx
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
2. Priority and Preemption
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "High priority pods"
Advanced Scheduling
1. Taints and Tolerations
# Node taint
kubectl taint nodes node1 key=value:NoSchedule
# Pod toleration
apiVersion: v1
kind: Pod
metadata:
name: tolerating-pod
spec:
tolerations:
- key: "key"
operator: "Equal"
value: "value"
effect: "NoSchedule"
2. Custom Scheduler Extenders
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
extenders:
- urlPrefix: "http://extender.example.com"
filterVerb: "filter"
prioritizeVerb: "prioritize"
weight: 1
bindVerb: "bind"
enableHTTPS: true
Performance Tuning
1. Scheduler Settings
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
percentageOfNodesToScore: 50
profiles:
- schedulerName: default-scheduler
plugins:
preFilter:
enabled:
- name: NodeResourcesFit
filter:
enabled:
- name: NodeUnschedulable
- name: NodeResourcesFit
2. Optimization Techniques
# Cache optimization
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
preFilter:
enabled:
- name: NodeResourcesFit
weight: 1
nodeResourcesFitArgs:
scoringStrategy:
type: MostAllocated
Monitoring and Debugging
1. Metrics Collection
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: scheduler
spec:
endpoints:
- interval: 30s
port: https-metrics
scheme: https
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
tlsConfig:
insecureSkipVerify: true
2. Debugging Tools
# View scheduler logs
kubectl logs -n kube-system kube-scheduler-master
# Check scheduler events
kubectl get events --field-selector reason=FailedScheduling
# Debug scheduling decisions
kubectl get pod pod-name -o yaml | kubectl alpha debug -it --image=busybox
Troubleshooting
Common Issues
- Pod Pending State
# Check scheduler logs
kubectl logs -n kube-system kube-scheduler-master
# View pod events
kubectl describe pod pending-pod
- Resource Constraints
# Check node resources
kubectl describe node node-name
# View resource quotas
kubectl describe quota
- Affinity/Anti-affinity Issues
# Verify node labels
kubectl get nodes --show-labels
# Check pod placement
kubectl get pods -o wide
Best Practices
Configuration
- Use appropriate scheduler profiles
- Configure resource quotas
- Set proper node affinities
Performance
- Optimize percentage of nodes to score
- Use efficient filtering plugins
- Configure proper priorities
Monitoring
- Track scheduling latency
- Monitor queue depth
- Set up alerts for failures
For more information, check out: