The Kubernetes Scheduler is responsible for assigning pods to nodes based on various constraints and policies. This deep dive explores its architecture, scheduling algorithms, and configuration options.

Architecture Overview

Component Architecture

API Server -> Scheduler -> Scheduling Queue
                      -> Scheduling Cycle
                      -> Binding Cycle

Key Components

Scheduling Queue
- Priority Queue
- Active/Backoff Queues
- Event Handlers
Scheduling Cycle
- Node Filtering
- Node Scoring
- Node Selection
Binding Cycle
- Volume Binding
- Pod Binding
- Post-Binding

Scheduling Process

1. Filtering Phase

// Node filtering example
type FilterPlugin interface {
    Filter(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status
}

// Example filter plugin
type NodeResourcesFit struct {...}

func (pl *NodeResourcesFit) Filter(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status {
    if nodeHasEnoughResources(nodeInfo, pod) {
        return nil
    }
    return framework.NewStatus(framework.Unschedulable, "Insufficient resources")
}

2. Scoring Phase

// Node scoring example
type ScorePlugin interface {
    Score(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) (int64, *Status)
}

// Example score plugin
type NodeResourcesBalancedAllocation struct {...}

func (pl *NodeResourcesBalancedAllocation) Score(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) (int64, *Status) {
    // Calculate resource balance score
    return calculateBalanceScore(node, pod), nil
}

Scheduler Configuration

1. Scheduler Profiles

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  plugins:
    score:
      disabled:
      - name: NodeResourcesLeastAllocated
      enabled:
      - name: NodeResourcesMostAllocated
        weight: 1

2. Custom Scheduler

apiVersion: v1
kind: Pod
metadata:
  name: custom-scheduled-pod
spec:
  schedulerName: my-custom-scheduler
  containers:
  - name: container
    image: nginx

Scheduling Policies

1. Node Affinity

apiVersion: v1
kind: Pod
metadata:
  name: affinity-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2

2. Pod Affinity/Anti-Affinity

apiVersion: v1
kind: Pod
metadata:
  name: pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: topology.kubernetes.io/zone

Resource Management

1. Resource Requests and Limits

apiVersion: v1
kind: Pod
metadata:
  name: resource-pod
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

2. Priority and Preemption

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "High priority pods"

Advanced Scheduling

1. Taints and Tolerations

# Node taint
kubectl taint nodes node1 key=value:NoSchedule

# Pod toleration
apiVersion: v1
kind: Pod
metadata:
  name: tolerating-pod
spec:
  tolerations:
  - key: "key"
    operator: "Equal"
    value: "value"
    effect: "NoSchedule"

2. Custom Scheduler Extenders

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
extenders:
- urlPrefix: "http://extender.example.com"
  filterVerb: "filter"
  prioritizeVerb: "prioritize"
  weight: 1
  bindVerb: "bind"
  enableHTTPS: true

Performance Tuning

1. Scheduler Settings

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
percentageOfNodesToScore: 50
profiles:
- schedulerName: default-scheduler
  plugins:
    preFilter:
      enabled:
      - name: NodeResourcesFit
    filter:
      enabled:
      - name: NodeUnschedulable
      - name: NodeResourcesFit

2. Optimization Techniques

# Cache optimization
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  plugins:
    preFilter:
      enabled:
      - name: NodeResourcesFit
        weight: 1
  nodeResourcesFitArgs:
    scoringStrategy:
      type: MostAllocated

Monitoring and Debugging

1. Metrics Collection

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: scheduler
spec:
  endpoints:
  - interval: 30s
    port: https-metrics
    scheme: https
    bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    tlsConfig:
      insecureSkipVerify: true

2. Debugging Tools

# View scheduler logs
kubectl logs -n kube-system kube-scheduler-master

# Check scheduler events
kubectl get events --field-selector reason=FailedScheduling

# Debug scheduling decisions
kubectl get pod pod-name -o yaml | kubectl alpha debug -it --image=busybox

Troubleshooting

Common Issues

Pod Pending State

# Check scheduler logs
kubectl logs -n kube-system kube-scheduler-master

# View pod events
kubectl describe pod pending-pod

Resource Constraints

# Check node resources
kubectl describe node node-name

# View resource quotas
kubectl describe quota

Affinity/Anti-affinity Issues

# Verify node labels
kubectl get nodes --show-labels

# Check pod placement
kubectl get pods -o wide

Best Practices

Configuration
- Use appropriate scheduler profiles
- Configure resource quotas
- Set proper node affinities
Performance
- Optimize percentage of nodes to score
- Use efficient filtering plugins
- Configure proper priorities
Monitoring
- Track scheduling latency
- Monitor queue depth
- Set up alerts for failures

For more information, check out:

Support Tools

Deep Dive: Kubernetes Scheduler