Chaos Engineering has emerged as a critical discipline for building resilient distributed systems. As organizations increasingly rely on Kubernetes for production workloads, the need for systematic failure testing becomes paramount. Litmus, a Cloud Native Computing Foundation (CNCF) project, provides a comprehensive chaos engineering platform specifically designed for Kubernetes environments.

This comprehensive guide explores implementing chaos engineering with Litmus, covering everything from basic setup to advanced production scenarios, automated testing pipelines, and observability integration.

Understanding Chaos Engineering Principles

The Foundation of Chaos Engineering

Chaos Engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. The practice involves four fundamental principles:

Define “steady state” as some measurable output - Establish baseline metrics that indicate normal system behavior
Hypothesize that this steady state will continue - Form testable hypotheses about system behavior under failure conditions
Introduce variables that reflect real-world events - Simulate realistic failure scenarios
Try to disprove the hypothesis - Design experiments to challenge system assumptions

Why Kubernetes Needs Chaos Engineering

Kubernetes environments introduce unique complexities:

Dynamic scheduling - Pods can be scheduled on any node
Network complexity - Multiple networking layers and service meshes
Storage dependencies - Persistent volumes and storage classes
Distributed state - etcd clusters and control plane components
Resource constraints - CPU, memory, and storage limitations

These complexities create numerous failure scenarios that traditional testing methods cannot adequately cover.

Litmus Architecture and Components

Core Components Overview

Litmus consists of several key components that work together to orchestrate chaos experiments:

Litmus Portal

The central management interface providing:

Experiment workflow management
Real-time monitoring and analytics
Team collaboration features
Integration with observability tools

Chaos Operator

A Kubernetes operator that:

Manages the lifecycle of chaos experiments
Handles experiment scheduling and execution
Provides resource management and cleanup

Chaos Exporter

Metrics collection component that:

Exports experiment metrics to Prometheus
Provides observability into chaos experiment results
Enables integration with monitoring dashboards

Chaos Runner

The execution engine that:

Runs individual chaos experiments
Manages experiment state and progression
Handles failure injection and recovery

Litmus CRDs (Custom Resource Definitions)

Litmus introduces several custom resources:

# ChaosEngine - Defines the chaos experiment configuration
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: default
spec:
  engineState: 'active'
  appinfo:
    appns: 'default'
    applabel: 'app=nginx'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '30'
        - name: CHAOS_INTERVAL
          value: '10'
        - name: FORCE
          value: 'false'

# ChaosExperiment - Defines the chaos experiment template
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-delete
  labels:
    name: pod-delete
    app.kubernetes.io/part-of: litmus
    app.kubernetes.io/component: chaosexperiment
    app.kubernetes.io/version: latest
spec:
  definition:
    scope: Namespaced
    permissions:
    - apiGroups: [""]
      resources: ["pods"]
      verbs: ["create","delete","get","list","patch","update","deletecollection"]
    - apiGroups: [""]
      resources: ["events"]
      verbs: ["create","get","list","patch","update"]
    - apiGroups: [""]
      resources: ["configmaps"]
      verbs: ["get","list"]
    - apiGroups: [""]
      resources: ["pods/log"]
      verbs: ["get","list","watch"]
    - apiGroups: [""]
      resources: ["pods/exec"]
      verbs: ["get","list","create"]
    - apiGroups: ["apps"]
      resources: ["deployments","statefulsets","replicasets","daemonsets"]
      verbs: ["list","get"]
    - apiGroups: ["apps"]
      resources: ["deployments/scale","statefulsets/scale"]
      verbs: ["patch"]
    - apiGroups: [""]
      resources: ["replicationcontrollers"]
      verbs: ["get","list"]
    - apiGroups: ["argoproj.io"]
      resources: ["rollouts"]
      verbs: ["list","get"]
    - apiGroups: ["batch"]
      resources: ["jobs"]
      verbs: ["create","list","get","delete","deletecollection"]
    - apiGroups: ["litmuschaos.io"]
      resources: ["chaosengines","chaosexperiments","chaosresults"]
      verbs: ["create","list","get","patch","update","delete"]
    image: "litmuschaos/go-runner:latest"
    imagePullPolicy: Always
    args:
    - -c
    - ./experiments -name pod-delete
    command:
    - /bin/bash
    env:
    - name: TOTAL_CHAOS_DURATION
      value: '15'
    - name: RAMP_TIME
      value: ''
    - name: FORCE
      value: 'true'
    - name: CHAOS_INTERVAL
      value: '5'
    - name: PODS_AFFECTED_PERC
      value: ''
    - name: LIB
      value: 'litmus'
    - name: TARGET_PODS
      value: ''
    - name: SEQUENCE
      value: 'parallel'
    labels:
      name: pod-delete
      app.kubernetes.io/part-of: litmus
      app.kubernetes.io/component: experiment-job
      app.kubernetes.io/version: latest

# ChaosResult - Stores the experiment execution results
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosResult
metadata:
  name: nginx-chaos-pod-delete
  namespace: default
spec:
  engine: nginx-chaos
  experiment: pod-delete
status:
  experimentStatus:
    phase: Completed
    verdict: Pass
  history:
    targets:
    - name: nginx-deployment-7d8f7b6c4f-xyz123
      kind: Pod

Litmus Installation and Setup

Prerequisites

Before installing Litmus, ensure your Kubernetes cluster meets these requirements:

# Verify Kubernetes version (1.17+)
kubectl version --short

# Check cluster permissions
kubectl auth can-i create customresourcedefinitions --all-namespaces
kubectl auth can-i create clusterroles
kubectl auth can-i create clusterrolebindings

# Verify storage class availability
kubectl get storageclass

Installing Litmus Using Helm

The recommended installation method uses Helm charts:

# Add Litmus Helm repository
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

# Create namespace for Litmus
kubectl create namespace litmus

# Install Litmus with custom values
cat > litmus-values.yaml <<EOF
portal:
  frontend:
    service:
      type: LoadBalancer
    resources:
      requests:
        memory: "250Mi"
        cpu: "125m"
      limits:
        memory: "512Mi"
        cpu: "250m"
  
  server:
    resources:
      requests:
        memory: "250Mi"
        cpu: "125m"
      limits:
        memory: "512Mi"
        cpu: "250m"
    
    authServer:
      resources:
        requests:
          memory: "250Mi"
          cpu: "125m"
        limits:
          memory: "512Mi"
          cpu: "250m"

mongodb:
  resources:
    requests:
      memory: "250Mi"
      cpu: "125m"
    limits:
      memory: "512Mi"
      cpu: "250m"
  persistence:
    size: 20Gi

ingress:
  enabled: true
  annotations:
    kubernetes.io/ingress.class: "nginx"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
  hosts:
    - host: litmus.your-domain.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: litmus-tls
      hosts:
        - litmus.your-domain.com
EOF

# Install Litmus
helm install litmus litmuschaos/litmus \
  --namespace litmus \
  --values litmus-values.yaml

Verifying the Installation

# Check all Litmus components are running
kubectl get pods -n litmus

# Verify custom resource definitions
kubectl get crd | grep chaos

# Check Litmus operator logs
kubectl logs -n litmus -l app.kubernetes.io/name=litmus-portal-server

# Access Litmus Portal (if using LoadBalancer)
kubectl get svc -n litmus litmus-portal-frontend-service

Initial Configuration

After installation, configure the initial setup:

# Get the default admin credentials
kubectl get secret litmus-portal-admin-secret -n litmus -o jsonpath='{.data.JWE_PASSWORD}' | base64 -d

# Create additional user accounts via the portal
# Or use the API:
curl -X POST \
  http://litmus.your-domain.com/auth/create_user \
  -H 'Content-Type: application/json' \
  -d '{
    "email": "user@example.com",
    "password": "secure-password",
    "username": "chaos-engineer",
    "role": "admin"
  }'

Setting Up Chaos Experiments

Experiment Categories

Litmus provides experiments across several categories:

Pod-Level Experiments

pod-delete: Kills one or more pods
pod-cpu-hog: Consumes CPU resources
pod-memory-hog: Consumes memory resources
pod-network-latency: Introduces network latency
pod-network-loss: Simulates packet loss
pod-network-corruption: Corrupts network packets

Node-Level Experiments

node-drain: Drains a Kubernetes node
node-cpu-hog: Consumes node CPU resources
node-memory-hog: Consumes node memory
node-io-stress: Creates I/O stress on nodes
kubelet-service-kill: Kills the kubelet service

Platform-Specific Experiments

aws-ec2-terminate: Terminates EC2 instances
gcp-vm-instance-stop: Stops GCP VM instances
azure-instance-stop: Stops Azure VM instances
aws-ebs-loss: Detaches EBS volumes

Creating Your First Chaos Experiment

Let’s create a comprehensive pod deletion experiment:

# chaos-experiment-pod-delete.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: pod-delete-chaos-workflow
  namespace: litmus
spec:
  entrypoint: chaos-workflow
  serviceAccountName: argo-chaos
  templates:
  - name: chaos-workflow
    steps:
    - - name: install-chaos-experiments
        template: install-chaos-experiments
    - - name: pod-delete
        template: pod-delete
    - - name: revert-chaos
        template: revert-chaos
  
  - name: install-chaos-experiments
    container:
      image: litmuschaos/k8s:latest
      command: [sh, -c]
      args:
        - "kubectl apply -f https://hub.litmuschaos.io/api/chaos/master?file=charts/generic/experiments.yaml -n {{workflow.parameters.adminModeNamespace}} | sleep 30"

  - name: pod-delete
    inputs:
      artifacts:
      - name: pod-delete
        path: /tmp/chaosengine-pod-delete.yaml
        raw:
          data: |
            apiVersion: litmuschaos.io/v1alpha1
            kind: ChaosEngine
            metadata:
              name: pod-delete-chaos
              namespace: default
            spec:
              engineState: 'active'
              appinfo:
                appns: 'default'
                applabel: 'app=nginx'
                appkind: 'deployment'
              chaosServiceAccount: litmus-admin
              experiments:
              - name: pod-delete
                spec:
                  components:
                    env:
                    - name: TOTAL_CHAOS_DURATION
                      value: '30'
                    - name: CHAOS_INTERVAL
                      value: '10'
                    - name: FORCE
                      value: 'false'
                    - name: PODS_AFFECTED_PERC
                      value: '50'
                  probe:
                  - name: nginx-probe
                    type: httpProbe
                    mode: Continuous
                    runProperties:
                      probeTimeout: 10s
                      retry: 3
                      interval: 5s
                      probePollingInterval: 2s
                    httpProbe/inputs:
                      url: http://nginx-service.default.svc.cluster.local
                      insecureSkipTLS: true
                      method:
                        get:
                          criteria: ==
                          responseCode: "200"
    container:
      image: litmuschaos/litmus-checker:latest
      args: ["-file=/tmp/chaosengine-pod-delete.yaml","-saveName=/tmp/engine-name"]

  - name: revert-chaos
    container:
      image: litmuschaos/k8s:latest
      command: [sh, -c]
      args:
        - "kubectl delete chaosengine pod-delete-chaos -n default"

Advanced Experiment Configuration

Multi-Step Experiment with Probes

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: comprehensive-chaos-test
  namespace: production
spec:
  engineState: 'active'
  appinfo:
    appns: 'production'
    applabel: 'tier=frontend'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-cpu-hog
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '60'
        - name: CPU_CORES
          value: '2'
        - name: PODS_AFFECTED_PERC
          value: '25'
      probe:
      - name: "cpu-probe"
        type: "cmdProbe"
        mode: "Edge"
        runProperties:
          probeTimeout: 5s
          retry: 3
          interval: 5s
        cmdProbe/inputs:
          command: "kubectl"
          args: 
            - "top"
            - "pods"
            - "-n"
            - "production"
            - "--sort-by=cpu"
          source:
            image: "bitnami/kubectl:latest"
            hostNetwork: false
          comparator:
            type: "int"
            criteria: "<"
            value: "80"
      
      - name: "availability-probe"
        type: "httpProbe"
        mode: "Continuous"
        runProperties:
          probeTimeout: 10s
          retry: 3
          interval: 10s
          probePollingInterval: 2s
        httpProbe/inputs:
          url: "http://frontend-service.production.svc.cluster.local/health"
          insecureSkipTLS: true
          method:
            get:
              criteria: "=="
              responseCode: "200"
      
      - name: "latency-probe"
        type: "promProbe"
        mode: "Continuous"
        runProperties:
          probeTimeout: 5s
          retry: 3
          interval: 5s
        promProbe/inputs:
          endpoint: "http://prometheus.monitoring.svc.cluster.local:9090"
          query: "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
          comparator:
            criteria: "<"
            value: "0.5"

Resource Stress Testing

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: resource-stress-test
  namespace: testing
spec:
  engineState: 'active'
  appinfo:
    appns: 'testing'
    applabel: 'app=microservice'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-memory-hog
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '120'
        - name: MEMORY_CONSUMPTION
          value: '500'
        - name: NUMBER_OF_WORKERS
          value: '4'
        - name: PODS_AFFECTED_PERC
          value: '50'
        - name: SEQUENCE
          value: 'parallel'
      probe:
      - name: "memory-usage-probe"
        type: "k8sProbe"
        mode: "Continuous"
        runProperties:
          probeTimeout: 5s
          retry: 3
          interval: 10s
        k8sProbe/inputs:
          group: ""
          version: "v1"
          resource: "pods"
          namespace: "testing"
          fieldSelector: "status.phase=Running"
          labelSelector: "app=microservice"
          operation: "present"
      
      - name: "oom-killer-probe"
        type: "cmdProbe"
        mode: "OnChaos"
        runProperties:
          probeTimeout: 30s
          retry: 1
          interval: 30s
        cmdProbe/inputs:
          command: "sh"
          args:
            - "-c"
            - "dmesg | grep -i 'killed process' | wc -l"
          source:
            image: "alpine:latest"
            hostNetwork: true
            inheritInputs: true
          comparator:
            type: "int"
            criteria: "=="
            value: "0"

Network Chaos Testing

Network Latency Experiments

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: network-latency-test
  namespace: ecommerce
spec:
  engineState: 'active'
  appinfo:
    appns: 'ecommerce'
    applabel: 'service=payment'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-network-latency
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '60'
        - name: NETWORK_LATENCY
          value: '2000'
        - name: JITTER
          value: '200'
        - name: CONTAINER_RUNTIME
          value: 'containerd'
        - name: SOCKET_PATH
          value: '/run/containerd/containerd.sock'
        - name: DESTINATION_IPS
          value: 'database-service.ecommerce.svc.cluster.local'
        - name: DESTINATION_HOSTS
          value: 'external-payment-api.com'
      probe:
      - name: "payment-latency-probe"
        type: "httpProbe"
        mode: "Continuous"
        runProperties:
          probeTimeout: 15s
          retry: 3
          interval: 5s
          probePollingInterval: 2s
        httpProbe/inputs:
          url: "http://payment-service.ecommerce.svc.cluster.local/api/process"
          insecureSkipTLS: true
          method:
            post:
              contentType: "application/json"
              body: '{"amount":100,"currency":"USD","test":true}'
              criteria: "<"
              responseTimeout: "10s"
      
      - name: "database-connectivity-probe"
        type: "cmdProbe"
        mode: "Continuous"
        runProperties:
          probeTimeout: 10s
          retry: 3
          interval: 15s
        cmdProbe/inputs:
          command: "nc"
          args:
            - "-zv"
            - "database-service.ecommerce.svc.cluster.local"
            - "5432"
          source:
            image: "busybox:latest"
          comparator:
            type: "string"
            criteria: "contains"
            value: "succeeded"

Packet Loss Simulation

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: packet-loss-simulation
  namespace: microservices
spec:
  engineState: 'active'
  appinfo:
    appns: 'microservices'
    applabel: 'tier=api-gateway'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-network-loss
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '90'
        - name: NETWORK_PACKET_LOSS_PERCENTAGE
          value: '10'
        - name: CONTAINER_RUNTIME
          value: 'containerd'
        - name: SOCKET_PATH
          value: '/run/containerd/containerd.sock'
        - name: DESTINATION_IPS
          value: 'user-service.microservices.svc.cluster.local,order-service.microservices.svc.cluster.local'
      probe:
      - name: "service-mesh-probe"
        type: "promProbe"
        mode: "Continuous"
        runProperties:
          probeTimeout: 10s
          retry: 3
          interval: 15s
        promProbe/inputs:
          endpoint: "http://prometheus.istio-system.svc.cluster.local:9090"
          query: "istio_request_total{destination_service_name=\"user-service\",response_code!=\"200\"}"
          comparator:
            criteria: "<"
            value: "5"
      
      - name: "circuit-breaker-probe"
        type: "k8sProbe"
        mode: "Edge"
        runProperties:
          probeTimeout: 5s
          retry: 3
          interval: 10s
        k8sProbe/inputs:
          group: "networking.istio.io"
          version: "v1beta1"
          resource: "destinationrules"
          namespace: "microservices"
          fieldSelector: "metadata.name=user-service-circuit-breaker"
          operation: "present"

Storage and Persistent Volume Chaos

Disk Fill Experiments

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: storage-chaos-test
  namespace: database
spec:
  engineState: 'active'
  appinfo:
    appns: 'database'
    applabel: 'app=postgresql'
    appkind: 'statefulset'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: disk-fill
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '180'
        - name: FILL_PERCENTAGE
          value: '80'
        - name: EPHEMERAL_STORAGE_MEBIBYTES
          value: '1000'
        - name: CONTAINER_PATH
          value: '/var/lib/postgresql/data'
      probe:
      - name: "database-health-probe"
        type: "cmdProbe"
        mode: "Continuous"
        runProperties:
          probeTimeout: 15s
          retry: 3
          interval: 30s
        cmdProbe/inputs:
          command: "pg_isready"
          args:
            - "-h"
            - "postgresql.database.svc.cluster.local"
            - "-p"
            - "5432"
            - "-U"
            - "postgres"
          source:
            image: "postgres:13"
          comparator:
            type: "string"
            criteria: "contains"
            value: "accepting connections"
      
      - name: "disk-usage-probe"
        type: "cmdProbe"
        mode: "Continuous"
        runProperties:
          probeTimeout: 10s
          retry: 2
          interval: 20s
        cmdProbe/inputs:
          command: "df"
          args:
            - "-h"
            - "/var/lib/postgresql/data"
          source:
            image: "busybox:latest"
          comparator:
            type: "string"
            criteria: "!contains"
            value: "100%"
      
      - name: "backup-integrity-probe"
        type: "httpProbe"
        mode: "Edge"
        runProperties:
          probeTimeout: 30s
          retry: 3
          interval: 60s
        httpProbe/inputs:
          url: "http://backup-service.database.svc.cluster.local/health"
          insecureSkipTLS: true
          method:
            get:
              criteria: "=="
              responseCode: "200"

PVC Detachment Simulation

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pvc-detachment-test
  namespace: storage-test
spec:
  engineState: 'active'
  appinfo:
    appns: 'storage-test'
    applabel: 'app=file-processor'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: disk-loss
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '120'
        - name: APP_NAMESPACE
          value: 'storage-test'
        - name: APP_LABEL
          value: 'app=file-processor'
        - name: APP_KIND
          value: 'deployment'
      probe:
      - name: "pvc-status-probe"
        type: "k8sProbe"
        mode: "Continuous"
        runProperties:
          probeTimeout: 10s
          retry: 3
          interval: 15s
        k8sProbe/inputs:
          group: ""
          version: "v1"
          resource: "persistentvolumeclaims"
          namespace: "storage-test"
          labelSelector: "app=file-processor"
          operation: "present"
      
      - name: "pod-restart-probe"
        type: "k8sProbe"
        mode: "OnChaos"
        runProperties:
          probeTimeout: 30s
          retry: 5
          interval: 10s
        k8sProbe/inputs:
          group: ""
          version: "v1"
          resource: "pods"
          namespace: "storage-test"
          labelSelector: "app=file-processor"
          fieldSelector: "status.phase=Running"
          operation: "present"

Observability Integration

Prometheus Metrics Integration

Configure Prometheus to collect Litmus metrics:

# prometheus-litmus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-litmus-config
  namespace: monitoring
data:
  litmus-metrics.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    rule_files:
      - "/etc/prometheus/rules/*.yml"
    
    scrape_configs:
    - job_name: 'litmus-metrics'
      static_configs:
      - targets: ['chaos-exporter.litmus.svc.cluster.local:8080']
      scrape_interval: 5s
      metrics_path: /metrics
    
    - job_name: 'chaos-operator-metrics'
      static_configs:
      - targets: ['chaos-operator-metrics.litmus.svc.cluster.local:8383']
      scrape_interval: 10s
      metrics_path: /metrics

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: litmus-chaos-exporter
  namespace: monitoring
  labels:
    app: litmus-chaos-exporter
spec:
  selector:
    matchLabels:
      app: chaos-exporter
  endpoints:
  - port: tcp-8080-8080
    interval: 5s
    path: /metrics
  namespaceSelector:
    matchNames:
    - litmus

Grafana Dashboard Configuration

Create comprehensive Grafana dashboards for chaos engineering metrics:

{
  "dashboard": {
    "id": null,
    "title": "Litmus Chaos Engineering Dashboard",
    "tags": ["chaos", "litmus", "sre"],
    "style": "dark",
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Chaos Experiment Status",
        "type": "stat",
        "targets": [
          {
            "expr": "litmuschaos_experiment_status",
            "legendFormat": "{{experiment_name}} - {{status}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "color": {
              "mode": "palette-classic"
            },
            "custom": {
              "displayMode": "list",
              "orientation": "horizontal"
            },
            "mappings": [
              {
                "options": {
                  "0": {
                    "text": "Not Started"
                  },
                  "1": {
                    "text": "Running"
                  },
                  "2": {
                    "text": "Completed"
                  },
                  "3": {
                    "text": "Failed"
                  }
                },
                "type": "value"
              }
            ]
          }
        }
      },
      {
        "id": 2,
        "title": "Experiment Success Rate",
        "type": "timeseries",
        "targets": [
          {
            "expr": "rate(litmuschaos_experiment_passed_total[5m]) / rate(litmuschaos_experiment_total[5m]) * 100",
            "legendFormat": "Success Rate %"
          }
        ]
      },
      {
        "id": 3,
        "title": "Application SLA During Chaos",
        "type": "timeseries",
        "targets": [
          {
            "expr": "probe_success",
            "legendFormat": "{{probe_name}}"
          }
        ]
      },
      {
        "id": 4,
        "title": "Resource Utilization During Chaos",
        "type": "timeseries",
        "targets": [
          {
            "expr": "rate(container_cpu_usage_seconds_total[1m]) * 100",
            "legendFormat": "CPU - {{pod}}"
          },
          {
            "expr": "container_memory_usage_bytes / container_spec_memory_limit_bytes * 100",
            "legendFormat": "Memory - {{pod}}"
          }
        ]
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "5s"
  }
}

Alert Rules for Chaos Experiments

# chaos-alert-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: litmus-chaos-alerts
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
  - name: chaos.rules
    rules:
    - alert: ChaosExperimentFailed
      expr: litmuschaos_experiment_status == 3
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: "Chaos experiment {{ $labels.experiment_name }} failed"
        description: "Chaos experiment {{ $labels.experiment_name }} in namespace {{ $labels.namespace }} has failed. This indicates a potential reliability issue."
    
    - alert: ApplicationSLABreach
      expr: probe_success == 0
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "Application SLA breach detected during chaos experiment"
        description: "Probe {{ $labels.probe_name }} is failing for more than 2 minutes during chaos testing."
    
    - alert: HighChaosExperimentFailureRate
      expr: (rate(litmuschaos_experiment_failed_total[10m]) / rate(litmuschaos_experiment_total[10m])) > 0.5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High chaos experiment failure rate detected"
        description: "More than 50% of chaos experiments are failing in the last 10 minutes."
    
    - alert: LongRunningChaosExperiment
      expr: time() - litmuschaos_experiment_start_time > 3600
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: "Chaos experiment running for more than 1 hour"
        description: "Experiment {{ $labels.experiment_name }} has been running for more than 1 hour, which may indicate it's stuck."

Automated Chaos Testing in CI/CD

GitLab CI Integration

# .gitlab-ci.yml
stages:
  - build
  - test
  - chaos-test
  - deploy

variables:
  KUBECONFIG: /tmp/kubeconfig
  CHAOS_NAMESPACE: "chaos-testing"

build:
  stage: build
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA

unit-tests:
  stage: test
  script:
    - go test ./...

chaos-test:
  stage: chaos-test
  image: litmuschaos/litmus-e2e:latest
  before_script:
    - echo "$KUBE_CONFIG" | base64 -d > $KUBECONFIG
    - kubectl config set-context --current --namespace=$CHAOS_NAMESPACE
  script:
    - |
      # Deploy application to chaos testing environment
      helm upgrade --install test-app ./helm-chart \
        --namespace $CHAOS_NAMESPACE \
        --set image.tag=$CI_COMMIT_SHA \
        --wait --timeout=300s
      
      # Run pod deletion chaos test
      kubectl apply -f - <<EOF
      apiVersion: litmuschaos.io/v1alpha1
      kind: ChaosEngine
      metadata:
        name: ci-pod-delete-$CI_PIPELINE_ID
        namespace: $CHAOS_NAMESPACE
      spec:
        engineState: 'active'
        appinfo:
          appns: '$CHAOS_NAMESPACE'
          applabel: 'app=test-app'
          appkind: 'deployment'
        chaosServiceAccount: litmus-admin
        experiments:
        - name: pod-delete
          spec:
            components:
              env:
              - name: TOTAL_CHAOS_DURATION
                value: '60'
              - name: CHAOS_INTERVAL
                value: '15'
              - name: FORCE
                value: 'false'
            probe:
            - name: availability-probe
              type: httpProbe
              mode: Continuous
              runProperties:
                probeTimeout: 10s
                retry: 3
                interval: 5s
              httpProbe/inputs:
                url: http://test-app.$CHAOS_NAMESPACE.svc.cluster.local/health
                insecureSkipTLS: true
                method:
                  get:
                    criteria: "=="
                    responseCode: "200"
      EOF
      
      # Wait for chaos experiment to complete
      kubectl wait --for=condition=complete \
        chaosresult/ci-pod-delete-$CI_PIPELINE_ID-pod-delete \
        --namespace=$CHAOS_NAMESPACE \
        --timeout=300s
      
      # Check experiment result
      VERDICT=$(kubectl get chaosresult ci-pod-delete-$CI_PIPELINE_ID-pod-delete \
        -n $CHAOS_NAMESPACE -o jsonpath='{.status.experimentStatus.verdict}')
      
      if [ "$VERDICT" != "Pass" ]; then
        echo "Chaos experiment failed with verdict: $VERDICT"
        exit 1
      fi
      
      echo "Chaos experiment passed successfully"
  after_script:
    - kubectl delete chaosengine ci-pod-delete-$CI_PIPELINE_ID -n $CHAOS_NAMESPACE --ignore-not-found
  only:
    - merge_requests
    - main

deploy-production:
  stage: deploy
  script:
    - helm upgrade --install prod-app ./helm-chart --namespace production
  only:
    - main
  when: manual

GitHub Actions Workflow

# .github/workflows/chaos-testing.yml
name: Chaos Engineering Tests

on:
  pull_request:
    branches: [ main ]
  push:
    branches: [ main ]

env:
  CHAOS_NAMESPACE: chaos-testing
  KUBECTL_VERSION: v1.28.0

jobs:
  chaos-test:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout code
      uses: actions/checkout@v4
    
    - name: Set up kubectl
      uses: azure/setup-kubectl@v3
      with:
        version: ${{ env.KUBECTL_VERSION }}
    
    - name: Configure kubeconfig
      run: |
        mkdir -p ~/.kube
        echo "${{ secrets.KUBECONFIG }}" | base64 -d > ~/.kube/config
    
    - name: Install Litmus ChaosCenter CLI
      run: |
        curl -O https://litmusctl-production-bucket.s3.amazonaws.com/litmusctl-linux-amd64-master.tar.gz
        tar -zxvf litmusctl-linux-amd64-master.tar.gz
        chmod +x litmusctl
        sudo mv litmusctl /usr/local/bin/
    
    - name: Deploy test application
      run: |
        kubectl create namespace $CHAOS_NAMESPACE --dry-run=client -o yaml | kubectl apply -f -
        helm upgrade --install test-app ./charts/app \
          --namespace $CHAOS_NAMESPACE \
          --set image.tag=${{ github.sha }} \
          --wait --timeout=300s
    
    - name: Run CPU stress test
      id: cpu-stress
      run: |
        cat > cpu-stress-test.yaml <<EOF
        apiVersion: litmuschaos.io/v1alpha1
        kind: ChaosEngine
        metadata:
          name: cpu-stress-${{ github.run_id }}
          namespace: $CHAOS_NAMESPACE
        spec:
          engineState: 'active'
          appinfo:
            appns: '$CHAOS_NAMESPACE'
            applabel: 'app=test-app'
            appkind: 'deployment'
          chaosServiceAccount: litmus-admin
          experiments:
          - name: pod-cpu-hog
            spec:
              components:
                env:
                - name: TOTAL_CHAOS_DURATION
                  value: '120'
                - name: CPU_CORES
                  value: '1'
                - name: PODS_AFFECTED_PERC
                  value: '50'
              probe:
              - name: response-time-probe
                type: httpProbe
                mode: Continuous
                runProperties:
                  probeTimeout: 5s
                  retry: 3
                  interval: 10s
                httpProbe/inputs:
                  url: http://test-app.$CHAOS_NAMESPACE.svc.cluster.local/api/health
                  insecureSkipTLS: true
                  method:
                    get:
                      criteria: "<"
                      responseTimeout: "3s"
        EOF
        
        kubectl apply -f cpu-stress-test.yaml
        
        # Wait for experiment completion
        timeout 300s bash -c 'until kubectl get chaosresult cpu-stress-${{ github.run_id }}-pod-cpu-hog -n $CHAOS_NAMESPACE; do sleep 10; done'
        
        # Get experiment verdict
        VERDICT=$(kubectl get chaosresult cpu-stress-${{ github.run_id }}-pod-cpu-hog \
          -n $CHAOS_NAMESPACE -o jsonpath='{.status.experimentStatus.verdict}')
        
        echo "cpu_stress_verdict=$VERDICT" >> $GITHUB_OUTPUT
        
        if [ "$VERDICT" != "Pass" ]; then
          echo "CPU stress test failed"
          exit 1
        fi
    
    - name: Run network latency test
      id: network-latency
      run: |
        cat > network-latency-test.yaml <<EOF
        apiVersion: litmuschaos.io/v1alpha1
        kind: ChaosEngine
        metadata:
          name: network-latency-${{ github.run_id }}
          namespace: $CHAOS_NAMESPACE
        spec:
          engineState: 'active'
          appinfo:
            appns: '$CHAOS_NAMESPACE'
            applabel: 'app=test-app'
            appkind: 'deployment'
          chaosServiceAccount: litmus-admin
          experiments:
          - name: pod-network-latency
            spec:
              components:
                env:
                - name: TOTAL_CHAOS_DURATION
                  value: '60'
                - name: NETWORK_LATENCY
                  value: '2000'
                - name: JITTER
                  value: '200'
                - name: CONTAINER_RUNTIME
                  value: 'containerd'
                - name: SOCKET_PATH
                  value: '/run/containerd/containerd.sock'
              probe:
              - name: latency-tolerance-probe
                type: httpProbe
                mode: Continuous
                runProperties:
                  probeTimeout: 15s
                  retry: 3
                  interval: 5s
                httpProbe/inputs:
                  url: http://test-app.$CHAOS_NAMESPACE.svc.cluster.local/api/ping
                  insecureSkipTLS: true
                  method:
                    get:
                      criteria: "=="
                      responseCode: "200"
        EOF
        
        kubectl apply -f network-latency-test.yaml
        
        # Wait for experiment completion
        timeout 300s bash -c 'until kubectl get chaosresult network-latency-${{ github.run_id }}-pod-network-latency -n $CHAOS_NAMESPACE; do sleep 10; done'
        
        # Get experiment verdict
        VERDICT=$(kubectl get chaosresult network-latency-${{ github.run_id }}-pod-network-latency \
          -n $CHAOS_NAMESPACE -o jsonpath='{.status.experimentStatus.verdict}')
        
        echo "network_latency_verdict=$VERDICT" >> $GITHUB_OUTPUT
        
        if [ "$VERDICT" != "Pass" ]; then
          echo "Network latency test failed"
          exit 1
        fi
    
    - name: Generate chaos test report
      run: |
        cat > chaos-test-report.md <<EOF
        # Chaos Engineering Test Report
        
        **Pipeline:** ${{ github.workflow }}
        **Commit:** ${{ github.sha }}
        **Branch:** ${{ github.ref_name }}
        **Run ID:** ${{ github.run_id }}
        
        ## Test Results
        
        | Test | Result |
        |------|--------|
        | CPU Stress Test | ${{ steps.cpu-stress.outputs.cpu_stress_verdict }} |
        | Network Latency Test | ${{ steps.network-latency.outputs.network_latency_verdict }} |
        
        ## Experiment Details
        
        - **CPU Stress Duration:** 120 seconds
        - **Network Latency:** 2000ms ± 200ms jitter
        - **Affected Pods:** 50%
        - **Probe Failures:** 0
        
        EOF
        
        cat chaos-test-report.md
    
    - name: Cleanup
      if: always()
      run: |
        kubectl delete chaosengine --all -n $CHAOS_NAMESPACE --ignore-not-found
        kubectl delete namespace $CHAOS_NAMESPACE --ignore-not-found

Jenkins Pipeline Integration

// Jenkinsfile
pipeline {
    agent any
    
    environment {
        KUBECONFIG = credentials('kubeconfig')
        CHAOS_NAMESPACE = 'chaos-testing'
        DOCKER_REGISTRY = 'your-registry.com'
    }
    
    stages {
        stage('Build and Test') {
            parallel {
                stage('Build') {
                    steps {
                        script {
                            def image = docker.build("${DOCKER_REGISTRY}/test-app:${BUILD_NUMBER}")
                            image.push()
                        }
                    }
                }
                
                stage('Unit Tests') {
                    steps {
                        sh 'go test ./...'
                    }
                }
            }
        }
        
        stage('Deploy to Chaos Environment') {
            steps {
                sh """
                    kubectl create namespace ${CHAOS_NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -
                    helm upgrade --install test-app ./helm-chart \\
                      --namespace ${CHAOS_NAMESPACE} \\
                      --set image.tag=${BUILD_NUMBER} \\
                      --wait --timeout=300s
                """
            }
        }
        
        stage('Chaos Engineering Tests') {
            parallel {
                stage('Pod Deletion Test') {
                    steps {
                        script {
                            sh """
                                cat > pod-deletion-test.yaml <<EOF
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-deletion-${BUILD_NUMBER}
  namespace: ${CHAOS_NAMESPACE}
spec:
  engineState: 'active'
  appinfo:
    appns: '${CHAOS_NAMESPACE}'
    applabel: 'app=test-app'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '60'
        - name: CHAOS_INTERVAL
          value: '10'
        - name: FORCE
          value: 'false'
      probe:
      - name: availability-probe
        type: httpProbe
        mode: Continuous
        runProperties:
          probeTimeout: 10s
          retry: 3
          interval: 5s
        httpProbe/inputs:
          url: http://test-app.${CHAOS_NAMESPACE}.svc.cluster.local/health
          insecureSkipTLS: true
          method:
            get:
              criteria: "=="
              responseCode: "200"
EOF
                                kubectl apply -f pod-deletion-test.yaml
                                
                                timeout 300 bash -c 'until kubectl get chaosresult pod-deletion-${BUILD_NUMBER}-pod-delete -n ${CHAOS_NAMESPACE}; do sleep 10; done'
                                
                                VERDICT=\$(kubectl get chaosresult pod-deletion-${BUILD_NUMBER}-pod-delete -n ${CHAOS_NAMESPACE} -o jsonpath='{.status.experimentStatus.verdict}')
                                
                                if [ "\$VERDICT" != "Pass" ]; then
                                    echo "Pod deletion test failed with verdict: \$VERDICT"
                                    exit 1
                                fi
                            """
                        }
                    }
                }
                
                stage('Memory Stress Test') {
                    steps {
                        script {
                            sh """
                                cat > memory-stress-test.yaml <<EOF
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: memory-stress-${BUILD_NUMBER}
  namespace: ${CHAOS_NAMESPACE}
spec:
  engineState: 'active'
  appinfo:
    appns: '${CHAOS_NAMESPACE}'
    applabel: 'app=test-app'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-memory-hog
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '90'
        - name: MEMORY_CONSUMPTION
          value: '500'
        - name: NUMBER_OF_WORKERS
          value: '2'
      probe:
      - name: memory-probe
        type: k8sProbe
        mode: Continuous
        runProperties:
          probeTimeout: 5s
          retry: 3
          interval: 10s
        k8sProbe/inputs:
          group: ""
          version: "v1"
          resource: "pods"
          namespace: "${CHAOS_NAMESPACE}"
          fieldSelector: "status.phase=Running"
          labelSelector: "app=test-app"
          operation: "present"
EOF
                                kubectl apply -f memory-stress-test.yaml
                                
                                timeout 300 bash -c 'until kubectl get chaosresult memory-stress-${BUILD_NUMBER}-pod-memory-hog -n ${CHAOS_NAMESPACE}; do sleep 10; done'
                                
                                VERDICT=\$(kubectl get chaosresult memory-stress-${BUILD_NUMBER}-pod-memory-hog -n ${CHAOS_NAMESPACE} -o jsonpath='{.status.experimentStatus.verdict}')
                                
                                if [ "\$VERDICT" != "Pass" ]; then
                                    echo "Memory stress test failed with verdict: \$VERDICT"
                                    exit 1
                                fi
                            """
                        }
                    }
                }
            }
        }
        
        stage('Collect Chaos Results') {
            steps {
                script {
                    sh """
                        echo "Collecting chaos engineering test results..."
                        kubectl get chaosresults -n ${CHAOS_NAMESPACE} -o yaml > chaos-results-${BUILD_NUMBER}.yaml
                        
                        # Generate summary report
                        echo "# Chaos Engineering Results - Build ${BUILD_NUMBER}" > chaos-summary.md
                        echo "" >> chaos-summary.md
                        echo "## Experiments Executed:" >> chaos-summary.md
                        kubectl get chaosresults -n ${CHAOS_NAMESPACE} --no-headers | while read result; do
                            name=\$(echo \$result | awk '{print \$1}')
                            verdict=\$(kubectl get chaosresult \$name -n ${CHAOS_NAMESPACE} -o jsonpath='{.status.experimentStatus.verdict}')
                            echo "- \$name: \$verdict" >> chaos-summary.md
                        done
                    """
                    
                    archiveArtifacts artifacts: 'chaos-results-*.yaml, chaos-summary.md', fingerprint: true
                }
            }
        }
    }
    
    post {
        always {
            sh """
                kubectl delete chaosengine --all -n ${CHAOS_NAMESPACE} --ignore-not-found || true
                kubectl delete namespace ${CHAOS_NAMESPACE} --ignore-not-found || true
            """
        }
        
        failure {
            emailext (
                subject: "Chaos Engineering Tests Failed - Build ${BUILD_NUMBER}",
                body: "The chaos engineering tests have failed for build ${BUILD_NUMBER}. Please check the Jenkins console output for details.",
                to: "${env.CHANGE_AUTHOR_EMAIL}"
            )
        }
        
        success {
            slackSend (
                color: 'good',
                message: "Chaos engineering tests passed for build ${BUILD_NUMBER}! 🎉"
            )
        }
    }
}

Production Readiness and Best Practices

Security Considerations

RBAC Configuration

# litmus-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: litmus-admin
  namespace: litmus
  labels:
    name: litmus-admin

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: litmus-admin
  labels:
    name: litmus-admin
rules:
- apiGroups: [""]
  resources: ["pods","events","configmaps","secrets","pods/log","pods/exec"]
  verbs: ["create","delete","get","list","patch","update","deletecollection"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["patch","get","list"]
- apiGroups: ["apps"]
  resources: ["deployments","statefulsets","replicasets","daemonsets"]
  verbs: ["list","get","patch","create","delete"]
- apiGroups: ["apps"]
  resources: ["deployments/scale","statefulsets/scale"]
  verbs: ["patch"]
- apiGroups: [""]
  resources: ["replicationcontrollers"]
  verbs: ["get","list"]
- apiGroups: ["argoproj.io"]
  resources: ["rollouts"]
  verbs: ["list","get","patch"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["create","list","get","delete","deletecollection"]
- apiGroups: ["litmuschaos.io"]
  resources: ["chaosengines","chaosexperiments","chaosresults"]
  verbs: ["create","list","get","patch","update","delete"]
- apiGroups: ["apiextensions.k8s.io"]
  resources: ["customresourcedefinitions"]
  verbs: ["list","get"]
- apiGroups: ["policy"]
  resources: ["podsecuritypolicies"]
  verbs: ["use"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: litmus-admin
  labels:
    name: litmus-admin
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: litmus-admin
subjects:
- kind: ServiceAccount
  name: litmus-admin
  namespace: litmus

---
# Namespace-scoped service account for production experiments
apiVersion: v1
kind: ServiceAccount
metadata:
  name: chaos-executor
  namespace: production

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: chaos-executor
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["delete","get","list"]
- apiGroups: [""]
  resources: ["events"]
  verbs: ["create","get","list"]
- apiGroups: ["apps"]
  resources: ["deployments","replicasets"]
  verbs: ["get","list","patch"]
- apiGroups: ["litmuschaos.io"]
  resources: ["chaosengines","chaosexperiments","chaosresults"]
  verbs: ["create","list","get","patch","update"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: chaos-executor
  namespace: production
subjects:
- kind: ServiceAccount
  name: chaos-executor
  namespace: production
roleRef:
  kind: Role
  name: chaos-executor
  apiGroup: rbac.authorization.k8s.io

Network Policies

# litmus-network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: litmus-ingress-policy
  namespace: litmus
spec:
  podSelector:
    matchLabels:
      app: litmus-portal-frontend
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    ports:
    - protocol: TCP
      port: 8080

---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: litmus-egress-policy
  namespace: litmus
spec:
  podSelector:
    matchLabels:
      app: chaos-operator
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443
    - protocol: TCP
      port: 6443
  - to: []
    ports:
    - protocol: TCP
      port: 53
    - protocol: UDP
      port: 53

Disaster Recovery Planning

Backup Strategy

#!/bin/bash
# backup-litmus.sh

BACKUP_DIR="/backup/litmus/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"

echo "Starting Litmus backup..."

# Backup CRDs
kubectl get crd -o yaml > "$BACKUP_DIR/crds.yaml"

# Backup Litmus resources
kubectl get chaosengines --all-namespaces -o yaml > "$BACKUP_DIR/chaosengines.yaml"
kubectl get chaosexperiments --all-namespaces -o yaml > "$BACKUP_DIR/chaosexperiments.yaml"
kubectl get chaosresults --all-namespaces -o yaml > "$BACKUP_DIR/chaosresults.yaml"

# Backup Litmus configuration
kubectl get configmaps -n litmus -o yaml > "$BACKUP_DIR/configmaps.yaml"
kubectl get secrets -n litmus -o yaml > "$BACKUP_DIR/secrets.yaml"

# Backup MongoDB data if using internal MongoDB
if kubectl get pods -n litmus | grep -q mongodb; then
    kubectl exec -n litmus deployment/mongodb -- mongodump --archive | gzip > "$BACKUP_DIR/mongodb.gz"
fi

# Create archive
tar -czf "$BACKUP_DIR.tar.gz" -C "$(dirname "$BACKUP_DIR")" "$(basename "$BACKUP_DIR")"
rm -rf "$BACKUP_DIR"

echo "Backup completed: $BACKUP_DIR.tar.gz"

Recovery Procedures

#!/bin/bash
# restore-litmus.sh

BACKUP_FILE="$1"

if [ -z "$BACKUP_FILE" ]; then
    echo "Usage: $0 <backup-file.tar.gz>"
    exit 1
fi

RESTORE_DIR="/tmp/litmus-restore/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$RESTORE_DIR"

echo "Extracting backup..."
tar -xzf "$BACKUP_FILE" -C "$RESTORE_DIR" --strip-components=1

echo "Restoring Litmus..."

# Restore CRDs first
kubectl apply -f "$RESTORE_DIR/crds.yaml"

# Wait for CRDs to be ready
sleep 30

# Restore Litmus namespace and basic resources
kubectl apply -f "$RESTORE_DIR/configmaps.yaml"
kubectl apply -f "$RESTORE_DIR/secrets.yaml"

# Restore chaos experiments (templates)
kubectl apply -f "$RESTORE_DIR/chaosexperiments.yaml"

# Restore MongoDB data if backup exists
if [ -f "$RESTORE_DIR/mongodb.gz" ]; then
    kubectl exec -n litmus deployment/mongodb -- sh -c 'mongorestore --archive --gzip' < "$RESTORE_DIR/mongodb.gz"
fi

echo "Restore completed from: $BACKUP_FILE"
rm -rf "$RESTORE_DIR"

Scaling and Performance Optimization

Resource Optimization

# litmus-performance-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaos-operator-config
  namespace: litmus
data:
  CHAOS_OPERATOR_LOG_LEVEL: "INFO"
  CHAOS_OPERATOR_WATCH_NAMESPACE: ""
  REQUEUE_TIME: "2"
  OPERATOR_SCOPE: "cluster"

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: chaos-operator
  namespace: litmus
spec:
  replicas: 2
  selector:
    matchLabels:
      name: chaos-operator
  template:
    metadata:
      labels:
        name: chaos-operator
    spec:
      serviceAccountName: litmus-admin
      containers:
      - name: chaos-operator
        image: litmuschaos/chaos-operator:latest
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        env:
        - name: CHAOS_RUNNER_IMAGE
          value: "litmuschaos/chaos-runner:latest"
        - name: WATCH_NAMESPACE
          value: ""
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: OPERATOR_NAME
          value: "chaos-operator"
        - name: REQUEUE_TIME
          value: "2"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8081
          initialDelaySeconds: 15
          periodSeconds: 20
        readinessProbe:
          httpGet:
            path: /readyz
            port: 8081
          initialDelaySeconds: 5
          periodSeconds: 10

---
apiVersion: v1
kind: Service
metadata:
  name: chaos-operator-metrics
  namespace: litmus
  labels:
    name: chaos-operator
spec:
  ports:
  - name: metrics
    port: 8383
    protocol: TCP
    targetPort: 8383
  selector:
    name: chaos-operator

Horizontal Pod Autoscaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: litmus-portal-server-hpa
  namespace: litmus
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: litmusportal-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60

Incident Response Automation

Automated Incident Detection

# incident-detection-workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: incident-response-workflow
  namespace: sre
spec:
  entrypoint: incident-response
  serviceAccountName: incident-responder
  templates:
  - name: incident-response
    steps:
    - - name: detect-anomaly
        template: detect-anomaly
    - - name: trigger-chaos-validation
        template: chaos-validation
        when: "{{steps.detect-anomaly.outputs.parameters.anomaly-detected}} == true"
    - - name: incident-mitigation
        template: incident-mitigation
        when: "{{steps.chaos-validation.outputs.parameters.validation-failed}} == true"

  - name: detect-anomaly
    script:
      image: prom/prometheus:latest
      command: [sh]
      source: |
        # Query Prometheus for anomalies
        RESPONSE=$(curl -s "http://prometheus.monitoring.svc.cluster.local:9090/api/v1/query?query=up{job=\"kubernetes-nodes\"}")
        
        # Check if any nodes are down
        DOWN_NODES=$(echo $RESPONSE | jq '.data.result[] | select(.value[1] == "0") | length')
        
        if [ "$DOWN_NODES" -gt 0 ]; then
          echo "true" > /tmp/anomaly-detected
          echo "Node failure detected" > /tmp/anomaly-reason
        else
          echo "false" > /tmp/anomaly-detected
          echo "No anomalies detected" > /tmp/anomaly-reason
        fi
    outputs:
      parameters:
      - name: anomaly-detected
        valueFrom:
          path: /tmp/anomaly-detected
      - name: anomaly-reason
        valueFrom:
          path: /tmp/anomaly-reason

  - name: chaos-validation
    script:
      image: litmuschaos/k8s:latest
      command: [sh]
      source: |
        # Create validation chaos experiment
        cat > /tmp/validation-experiment.yaml <<EOF
        apiVersion: litmuschaos.io/v1alpha1
        kind: ChaosEngine
        metadata:
          name: incident-validation
          namespace: default
        spec:
          engineState: 'active'
          appinfo:
            appns: 'default'
            applabel: 'app=critical-service'
            appkind: 'deployment'
          chaosServiceAccount: litmus-admin
          experiments:
          - name: pod-delete
            spec:
              components:
                env:
                - name: TOTAL_CHAOS_DURATION
                  value: '30'
                - name: FORCE
                  value: 'false'
              probe:
              - name: service-availability
                type: httpProbe
                mode: Continuous
                runProperties:
                  probeTimeout: 10s
                  retry: 3
                  interval: 5s
                httpProbe/inputs:
                  url: http://critical-service.default.svc.cluster.local/health
                  insecureSkipTLS: true
                  method:
                    get:
                      criteria: "=="
                      responseCode: "200"
        EOF
        
        kubectl apply -f /tmp/validation-experiment.yaml
        
        # Wait for result
        sleep 60
        
        VERDICT=$(kubectl get chaosresult incident-validation-pod-delete -o jsonpath='{.status.experimentStatus.verdict}' 2>/dev/null || echo "Failed")
        
        if [ "$VERDICT" != "Pass" ]; then
          echo "true" > /tmp/validation-failed
        else
          echo "false" > /tmp/validation-failed
        fi
        
        kubectl delete chaosengine incident-validation --ignore-not-found
    outputs:
      parameters:
      - name: validation-failed
        valueFrom:
          path: /tmp/validation-failed

  - name: incident-mitigation
    script:
      image: alpine/helm:latest
      command: [sh]
      source: |
        # Implement automated mitigation strategies
        echo "Implementing incident mitigation..."
        
        # Scale up replicas
        kubectl scale deployment critical-service --replicas=10 -n default
        
        # Update service mesh configuration for circuit breaking
        kubectl patch destinationrule critical-service -n default --type='merge' -p='
        {
          "spec": {
            "trafficPolicy": {
              "outlierDetection": {
                "consecutiveErrors": 3,
                "interval": "30s",
                "baseEjectionTime": "30s",
                "maxEjectionPercent": 50
              }
            }
          }
        }'
        
        # Send alert to incident response team
        curl -X POST "$SLACK_WEBHOOK_URL" \
          -H 'Content-type: application/json' \
          --data '{"text":"🚨 Automated incident mitigation triggered. Critical service scaled up and circuit breaker activated."}'

Runbook Automation

# runbook-automation.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaos-runbooks
  namespace: sre
data:
  database-failure-runbook.yaml: |
    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
      name: database-failure-runbook
    spec:
      entrypoint: database-recovery
      templates:
      - name: database-recovery
        steps:
        - - name: verify-failure
            template: verify-database-failure
        - - name: failover-database
            template: failover-to-replica
            when: "{{steps.verify-failure.outputs.parameters.database-down}} == true"
        - - name: notify-team
            template: send-notification
      
      - name: verify-database-failure
        script:
          image: postgres:13
          command: [sh]
          source: |
            pg_isready -h database.production.svc.cluster.local -p 5432 -U postgres
            if [ $? -eq 0 ]; then
              echo "false" > /tmp/database-down
            else
              echo "true" > /tmp/database-down
            fi
        outputs:
          parameters:
          - name: database-down
            valueFrom:
              path: /tmp/database-down
      
      - name: failover-to-replica
        script:
          image: bitnami/kubectl:latest
          command: [sh]
          source: |
            # Update service to point to replica
            kubectl patch service database -n production -p '{"spec":{"selector":{"app":"database-replica"}}}'
            
            # Scale down failed primary
            kubectl scale statefulset database-primary --replicas=0 -n production
            
            # Promote replica to primary
            kubectl patch statefulset database-replica -n production -p '{"spec":{"template":{"metadata":{"labels":{"role":"primary"}}}}}'
      
      - name: send-notification
        script:
          image: curlimages/curl:latest
          command: [sh]
          source: |
            curl -X POST "$TEAMS_WEBHOOK_URL" \
              -H 'Content-Type: application/json' \
              -d '{
                "title": "Database Failover Completed",
                "text": "Automated database failover has been executed. Primary database failed and traffic has been redirected to replica.",
                "potentialAction": [{
                  "@type": "OpenUri",
                  "name": "View Grafana Dashboard",
                  "targets": [{"os": "default", "uri": "https://grafana.company.com/d/database"}]
                }]
              }'

  network-partition-runbook.yaml: |
    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
      name: network-partition-runbook
    spec:
      entrypoint: network-recovery
      templates:
      - name: network-recovery
        steps:
        - - name: detect-partition
            template: detect-network-partition
        - - name: activate-circuit-breaker
            template: enable-circuit-breaker
            when: "{{steps.detect-partition.outputs.parameters.partition-detected}} == true"
        - - name: reroute-traffic
            template: traffic-rerouting
            when: "{{steps.detect-partition.outputs.parameters.partition-detected}} == true"
      
      - name: detect-network-partition
        script:
          image: nicolaka/netshoot:latest
          command: [sh]
          source: |
            # Test connectivity between microservices
            SERVICES=("user-service" "order-service" "payment-service")
            FAILURES=0
            
            for service in "${SERVICES[@]}"; do
              if ! nc -zv $service.microservices.svc.cluster.local 80; then
                FAILURES=$((FAILURES + 1))
              fi
            done
            
            if [ $FAILURES -gt 1 ]; then
              echo "true" > /tmp/partition-detected
            else
              echo "false" > /tmp/partition-detected
            fi
        outputs:
          parameters:
          - name: partition-detected
            valueFrom:
              path: /tmp/partition-detected
      
      - name: enable-circuit-breaker
        script:
          image: istio/pilot:latest
          command: [sh]
          source: |
            # Apply emergency circuit breaker configuration
            kubectl apply -f - <<EOF
            apiVersion: networking.istio.io/v1beta1
            kind: DestinationRule
            metadata:
              name: emergency-circuit-breaker
              namespace: microservices
            spec:
              host: "*.microservices.svc.cluster.local"
              trafficPolicy:
                outlierDetection:
                  consecutiveErrors: 1
                  interval: 10s
                  baseEjectionTime: 30s
                  maxEjectionPercent: 100
                connectionPool:
                  tcp:
                    maxConnections: 1
                  http:
                    http1MaxPendingRequests: 1
                    maxRequestsPerConnection: 1
            EOF
      
      - name: traffic-rerouting
        script:
          image: bitnami/kubectl:latest
          command: [sh]
          source: |
            # Route traffic to backup region
            kubectl patch virtualservice api-gateway -n microservices --type='merge' -p='
            {
              "spec": {
                "http": [{
                  "route": [{
                    "destination": {
                      "host": "api-gateway.backup-region.svc.cluster.local"
                    },
                    "weight": 100
                  }]
                }]
              }
            }'

Advanced Chaos Scenarios

Multi-Region Chaos Testing

# multi-region-chaos.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: multi-region-chaos-test
  namespace: chaos-engineering
spec:
  entrypoint: multi-region-test
  serviceAccountName: chaos-engineer
  arguments:
    parameters:
    - name: primary-region
      value: "us-east-1"
    - name: backup-region
      value: "us-west-2"
  
  templates:
  - name: multi-region-test
    steps:
    - - name: baseline-metrics
        template: collect-baseline
    - - name: region-failure-simulation
        template: simulate-region-failure
        arguments:
          parameters:
          - name: failed-region
            value: "{{workflow.parameters.primary-region}}"
    - - name: validate-failover
        template: validate-regional-failover
    - - name: recovery-test
        template: test-recovery
    - - name: final-validation
        template: validate-full-recovery

  - name: collect-baseline
    script:
      image: prom/prometheus:latest
      command: [sh]
      source: |
        # Collect baseline metrics across regions
        echo "Collecting baseline metrics..."
        
        METRICS=(
          "http_requests_total"
          "http_request_duration_seconds"
          "up{job='kubernetes-nodes'}"
          "cluster_health_score"
        )
        
        for metric in "${METRICS[@]}"; do
          curl -s "http://prometheus.monitoring.svc.cluster.local:9090/api/v1/query?query=$metric" \
            > "/tmp/baseline_$(echo $metric | tr '{}=' '___').json"
        done

  - name: simulate-region-failure
    inputs:
      parameters:
      - name: failed-region
    script:
      image: litmuschaos/k8s:latest
      command: [sh]
      source: |
        FAILED_REGION="{{inputs.parameters.failed-region}}"
        
        # Create comprehensive region failure scenario
        cat > /tmp/region-failure-experiment.yaml <<EOF
        apiVersion: litmuschaos.io/v1alpha1
        kind: ChaosEngine
        metadata:
          name: region-failure-simulation
          namespace: production
        spec:
          engineState: 'active'
          appinfo:
            appns: 'production'
            applabel: "region=$FAILED_REGION"
            appkind: 'deployment'
          chaosServiceAccount: litmus-admin
          experiments:
          - name: node-drain
            spec:
              components:
                env:
                - name: TARGET_NODE
                  value: ""
                - name: NODE_LABEL
                  value: "topology.kubernetes.io/zone=$FAILED_REGION"
          - name: pod-network-partition
            spec:
              components:
                env:
                - name: TOTAL_CHAOS_DURATION
                  value: '300'
                - name: DESTINATION_IPS
                  value: "*.us-west-2.compute.amazonaws.com"
          - name: aws-ec2-terminate-by-tag
            spec:
              components:
                env:
                - name: TOTAL_CHAOS_DURATION
                  value: '300'
                - name: EC2_INSTANCE_TAG
                  value: "Region=$FAILED_REGION"
                - name: MANAGED_NODEGROUP
                  value: "enabled"
        EOF
        
        kubectl apply -f /tmp/region-failure-experiment.yaml

  - name: validate-regional-failover
    script:
      image: curlimages/curl:latest
      command: [sh]
      source: |
        echo "Validating regional failover..."
        
        # Test API endpoints
        ENDPOINTS=(
          "https://api.company.com/health"
          "https://api.company.com/users"
          "https://api.company.com/orders"
        )
        
        for endpoint in "${ENDPOINTS[@]}"; do
          response=$(curl -s -o /dev/null -w "%{http_code}" "$endpoint")
          if [ "$response" != "200" ]; then
            echo "FAIL: $endpoint returned $response"
            exit 1
          else
            echo "PASS: $endpoint accessible"
          fi
        done
        
        # Validate traffic is routed to backup region
        trace_output=$(curl -s -H "X-Trace-Region: true" https://api.company.com/health)
        if echo "$trace_output" | grep -q "us-west-2"; then
          echo "PASS: Traffic routed to backup region"
        else
          echo "FAIL: Traffic not properly routed"
          exit 1
        fi

  - name: test-recovery
    script:
      image: litmuschaos/k8s:latest
      command: [sh]
      source: |
        echo "Testing recovery procedures..."
        
        # Simulate recovery by stopping chaos experiments
        kubectl delete chaosengine region-failure-simulation -n production
        
        # Wait for nodes to recover
        sleep 120
        
        # Gradually restore traffic to primary region
        for weight in 25 50 75 100; do
          kubectl patch virtualservice api-gateway -n production --type='merge' -p="
          {
            \"spec\": {
              \"http\": [{
                \"route\": [
                  {
                    \"destination\": {\"host\": \"api-gateway.us-east-1.local\"},
                    \"weight\": $weight
                  },
                  {
                    \"destination\": {\"host\": \"api-gateway.us-west-2.local\"},
                    \"weight\": $((100 - weight))
                  }
                ]
              }]
            }
          }"
          
          echo "Traffic weight adjusted: Primary $weight%, Backup $((100 - weight))%"
          sleep 60
        done

  - name: validate-full-recovery
    script:
      image: prom/prometheus:latest
      command: [sh]
      source: |
        echo "Validating full recovery..."
        
        # Check all nodes are ready
        ready_nodes=$(kubectl get nodes --no-headers | grep Ready | wc -l)
        total_nodes=$(kubectl get nodes --no-headers | wc -l)
        
        if [ "$ready_nodes" -eq "$total_nodes" ]; then
          echo "PASS: All nodes are ready ($ready_nodes/$total_nodes)"
        else
          echo "FAIL: Not all nodes are ready ($ready_nodes/$total_nodes)"
          exit 1
        fi
        
        # Validate metrics have returned to baseline
        current_rps=$(curl -s "http://prometheus.monitoring.svc.cluster.local:9090/api/v1/query?query=rate(http_requests_total[5m])" | jq '.data.result[0].value[1]' | sed 's/"//g')
        baseline_rps=$(cat /tmp/baseline_http_requests_total.json | jq '.data.result[0].value[1]' | sed 's/"//g')
        
        variance=$(echo "scale=2; ($current_rps - $baseline_rps) / $baseline_rps * 100" | bc)
        
        if (( $(echo "$variance < 10" | bc -l) )); then
          echo "PASS: RPS within 10% of baseline (variance: ${variance}%)"
        else
          echo "FAIL: RPS significantly different from baseline (variance: ${variance}%)"
          exit 1
        fi

Service Mesh Chaos Testing

# service-mesh-chaos.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: service-mesh-chaos
  namespace: microservices
spec:
  engineState: 'active'
  appinfo:
    appns: 'microservices'
    applabel: 'version=v1'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: istio-proxy-kill
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '60'
        - name: CHAOS_INTERVAL
          value: '15'
        - name: FORCE
          value: 'false'
        - name: PROXY_CONTAINER
          value: 'istio-proxy'
      probe:
      - name: service-mesh-metrics
        type: promProbe
        mode: Continuous
        runProperties:
          probeTimeout: 10s
          retry: 3
          interval: 15s
        promProbe/inputs:
          endpoint: "http://prometheus.istio-system.svc.cluster.local:9090"
          query: "istio_requests_total{destination_service_name='user-service',response_code='200'}"
          comparator:
            criteria: ">"
            value: "0"
      
      - name: circuit-breaker-status
        type: k8sProbe
        mode: Edge
        runProperties:
          probeTimeout: 5s
          retry: 3
          interval: 30s
        k8sProbe/inputs:
          group: "networking.istio.io"
          version: "v1beta1"
          resource: "destinationrules"
          namespace: "microservices"
          fieldSelector: "metadata.name=user-service-circuit-breaker"
          operation: "present"

  - name: envoy-config-corruption
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '120'
        - name: TARGET_CONTAINER
          value: 'istio-proxy'
        - name: CONFIG_MAP_NAME
          value: 'istio-envoy-config'
      probe:
      - name: envoy-health-check
        type: httpProbe
        mode: Continuous
        runProperties:
          probeTimeout: 5s
          retry: 3
          interval: 10s
        httpProbe/inputs:
          url: "http://user-service.microservices.svc.cluster.local:15000/ready"
          insecureSkipTLS: true
          method:
            get:
              criteria: "=="
              responseCode: "200"
      
      - name: traffic-routing-validation
        type: cmdProbe
        mode: Continuous
        runProperties:
          probeTimeout: 15s
          retry: 3
          interval: 20s
        cmdProbe/inputs:
          command: "curl"
          args:
            - "-H"
            - "x-test-route: canary"
            - "http://user-service.microservices.svc.cluster.local/api/test"
          source:
            image: "curlimages/curl:latest"
          comparator:
            type: "string"
            criteria: "contains"
            value: "canary-response"

This comprehensive guide provides a complete foundation for implementing chaos engineering with Litmus on Kubernetes. The examples demonstrate production-ready configurations, automated testing pipelines, observability integration, and incident response automation. By following these practices, organizations can build more resilient systems and improve their confidence in handling production failures.

Key takeaways include starting with simple experiments, gradually increasing complexity, integrating chaos testing into CI/CD pipelines, maintaining comprehensive observability, and automating incident response procedures. The combination of systematic chaos testing and automated response mechanisms creates a robust foundation for reliable distributed systems.