← Back to posts
chaos-engineering reliability kubernetes · · 15 min read

Best Self-Hosted Chaos Engineering Platforms: Litmus vs Chaos Mesh vs Chaos Toolkit 2026

Complete guide to self-hosted chaos engineering in 2026. Compare Litmus Chaos, Chaos Mesh, and Chaos Toolkit — open-source platforms for building resilient systems.

OS
Editorial Team

Why Chaos Engineering Matters in 2026

Every system fails eventually. The question isn’t if something will break — it’s when, and how well your team responds when it does. Chaos engineering is the disciplined practice of running controlled experiments on production and staging systems to uncover weaknesses before they cause real outages.

The idea is straightforward: if you know your system can survive a database node going offline, a network partition between microservices, or a sudden CPU spike, you can sleep better at night. But running these experiments manually is impractical at scale. That’s where chaos engineering platforms come in.

In 2026, three open-source platforms dominate the self-hosted chaos engineering landscape: Litmus Chaos, Chaos Mesh, and Chaos Toolkit. Each takes a different approach to injecting failures, measuring impact, and building confidence in your infrastructure. This guide covers all three — how they work, how to deploy them, and which one fits your stack.

What Is Chaos Engineering?

Chaos engineering is the practice of experimentally probing a distributed system to build confidence in its ability to withstand turbulent conditions. The core workflow follows four steps:

  1. Define a steady state — What does “healthy” look like? (e.g., API latency < 200ms, error rate < 0.1%)
  2. Hypothesize — “If we kill pod X, the system should recover within 30 seconds with no user-facing errors.”
  3. Inject failure — Use a chaos tool to deliberately break something in a controlled way.
  4. Observe and verify — Did the system behave as expected? If not, you found a weakness to fix.

The benefits are concrete:

  • Catch hidden single points of failure before they cause outages
  • Validate retry logic, circuit breakers, and failover paths under real conditions
  • Train your team to respond to incidents in a safe, controlled environment
  • Meet reliability SLAs by proactively proving your system can handle failure
  • Reduce mean time to recovery (MTTR) through practiced incident response

For teams running kubernetes, microservices, or any distributed architecture, chaos engineering isn’t optional — it’s a necessity.

Litmus Chaos: The CNCF Graduated Platform

Litmus Chaos is a CNCF-graduated project that provides a Kubernetes-native chaos engineering framework. It’s designed around the concept of “chaos experiments” — reusable, versioned definitions of what to break, how to break it, and what to measure.

Architecture

Litmus consists of three main components:

  • Litmus Portal — A web UI and control plane for managing chaos experiments across multiple clusters
  • Chaos Operator — A Kubernetes operator that watches for ChaosEngine custom resources and executes experiments
  • Chaos Hub — A catalog of pre-built chaos experiments (pod delete, network latency, disk fill, etc.)

Installation

1
2
3
4
5
6
7
8
9
# Add the Litmus Helm repository
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm
helm repo update

# Install Litmus on your Kubernetes cluster
helm install litmus litmuschaos/litmus \
  --namespace litmus \
  --create-namespace \
  --version 3.0.0

Wait for all components to be ready:

1
2
3
kubectl get pods -n litmus
# You should see: litmusportal-auth-server, litmusportal-frontend,
# litmusportal-server, chaos-operator, and chaos-runner pods

Running Your First Experiment

Litmus uses ChaosEngine resources to define experiments. Here’s a pod-delete experiment targeting a specific deployment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-pod-delete
  namespace: litmus
spec:
  appinfo:
    appns: default
    applabel: "app=nginx"
    appkind: deployment
  engineState: "active"
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "15"
            - name: FORCE
              value: "true"
        probe:
          - name: check-nginx-available
            type: readinessProbe
            mode: Continuous
            readinessProbe:
              type: k8sProbe
              k8sProbeInputs:
                group: ""
                version: "v1"
                resource: pods
                operation: present
                labelSelector: "app=nginx"

Apply it:

1
kubectl apply -f nginx-pod-delete.yaml

Litmus will randomly delete nginx pods every 15 seconds for 60 seconds while continuously checking that at least one pod remains available via the readiness probe. You can watch results in the Litmus Portal UI or via:

1
2
kubectl get chaosexperiment -n litmus
kubectl get chaosresult -n litmus

Key Features

  • CNCF graduated status means it’s production-ready with strong governance
  • Chaos Hub provides 50+ pre-built experiments out of the box
  • Schedules for running experiments on a recurring basis (e.g., every Friday at 3 AM)
  • Multi-cluster support through Litmus Portal for managing chaos across environments
  • GitOps integration — define experiments as YAML and store them in version control
  • Role-based access control for team-based chaos workflows
  • Prometheus integration for scraping metrics during experiments

When to Choose Litmus

Pick Litmus if you’re running Kubernetes, want a mature CNCF-graduated project, need a rich catalog of pre-built experiments, and prefer a web UI for managing chaos workflows alongside GitOps-style YAML definitions.

Chaos Mesh: Kubernetes-Native with Visual Workflows

Chaos Mesh is a cloud-native chaos engineering platform originally developed by PingCAP (the team behind TiDB). It provides a comprehensive set of fault types and a powerful visual workflow builder for orchestrating complex chaos experiments.

Architecture

Chaos Mesh runs as a set of Kubernetes controllers:

  • Chaos Dashboard — A web UI for creating, scheduling, and monitoring experiments
  • Chaos Controller Manager — Reconciles chaos CRDs and manages experiment lifecycle
  • Chaos Daemon — A DaemonSet that executes low-level fault injection on each node
  • Webhook — Validates chaos experiment configurations before execution

Installation

1
2
3
4
5
6
7
8
9
# Install Chaos Mesh via Helm
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --create-namespace \
  --version 2.6.2 \
  --set dashboard.service.type=NodePort

Verify the installation:

1
2
3
kubectl get pods -n chaos-mesh
# Should show: chaos-controller-manager, chaos-daemon (on each node),
# chaos-dashboard, and chaos-mesh components

Access the dashboard:

1
2
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333
# Open http://localhost:2333 in your browser

Fault Injection Types

Chaos Mesh supports an extensive range of fault types:

Fault TypeWhat It DoesUse Case
PodChaosKill, pause, or stress podsTest pod rescheduling and restart policies
NetworkChaosDelay, loss, duplicate, corrupt, partitionValidate network resilience and timeout handling
IOChaosInject latency, errors, or attrition in disk I/OTest database and storage layer resilience
StressChaosCPU and memory stress on podsValidate resource limits and auto-scaling
TimeChaosClock skew on podsTest time-sensitive logic (cert expiry, TTLs)
HTTPChaosAbort, delay, or replace HTTP requestsTest API gateway and service mesh resilience
JVMChaosInject faults into Java applicationsTest JVM-based microservices (Spring Boot, etc.)
DNSChaosReturn error or random responsesTest DNS fallback and caching behavior
PhysicalMachineChaosFaults on bare metal or VMsTest non-containerized infrastructure

Running a Network Chaos Experiment

Here’s how to simulate network latency between two services:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: api-to-database-latency
  namespace: production
spec:
  action: delay
  mode: all
  selector:
    labelSelectors:
      "app": "api-gateway"
  target:
    mode: all
    selector:
      labelSelectors:
        "app": "database-proxy"
  delay:
    latency: "200ms"
    correlation: "25"
    jitter: "50ms"
  duration: "5m"

This injects 200ms ± 50ms of latency (with 25% correlation) on all traffic from the API gateway to the database proxy for 5 minutes. Your application’s timeout and retry logic will be tested in real time.

Scheduling Recurring Experiments

Chaos Mesh supports cron-based scheduling through Schedule resources:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: weekly-pod-kill
spec:
  schedule: "0 3 * * 5"  # Every Friday at 3:00 AM
  type: "PodChaos"
  historyLimit: 5
  concurrencyPolicy: "Forbid"
  podChaos:
    action: pod-kill
    mode: one
    selector:
      namespaces:
        - "ecommerce"
      labelSelectors:
        "app": "payment-service"
    gracePeriod: 0

Key Features

  • 20+ fault injection types covering pods, networks, storage, JVM, HTTP, DNS, and physical machines
  • Visual workflow builder in the dashboard for creating multi-stage experiments
  • Namespace and label selectors for precise targeting
  • Cron scheduling with concurrency controls
  • Physical machine chaos for non-Kubernetes infrastructure
  • Namespace isolation — restrict chaos to specific namespaces for safety
  • RBAC integration with Kubernetes native authentication

When to Choose Chaos Mesh

Choose Chaos Mesh if you need the widest range of fault injection types, want to test non-Kubernetes infrastructure alongside Kubernetes workloads, prefer visual workflow construction, or need JVM-level chaos for Java-based microservices.

Chaos Toolkit: API-First and Platform-Agnostic

Chaos Toolkit takes a fundamentally different approach. Instead of a Kubernetes operator or a web dashboard, it provides a Python-based CLI and API for defining and running chaos experiments as JSON or YAML files. It’s platform-agnostic — it workdocker Kubernetes, AWS, Docker, bare metal, and virtually anything with an API.

Installation

1
2
3
4
5
6
7
8
# Install the core toolkit
pip install chaostoolkit

# Install extensions for your platform
pip install chaostoolkit-kubernetes
pip install chaostoolkit-aws
pip install chaostoolkit-docker
pip install chaostoolkit-libcloud

Verify installation:

1
2
chaos --version
chaos discover --help

Writing an Experiment

Chaos Toolkit experiments are defined in YAML. Here’s an experiment that tests whether your application survives a Kubernetes deployment restart:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
title: "Can our application survive a deployment restart?"
description: >
  This experiment verifies that the application remains available
  during a rolling deployment restart, with no more than 5% error rate.

configuration:
  kubernetes:
    namespace: production

steady-state-hypothesis:
  title: "Application responds normally"
  probes:
    - type: probe
      name: "application-must-respond"
      tolerance: 200
      provider:
        type: python
        module: chaostoolkit.probes
        func: call_url
        arguments:
          url: "https://api.example.com/health"
          timeout: 5
          verify_tls: true

method:
  - type: action
    name: "restart-deployment"
    provider:
      type: python
      module: chaosk8s.deployment.actions
      func: delete_deployment
      arguments:
        name: my-application
        ns: production
    pauses:
      - after: 30

  - type: probe
    name: "verify-during-restart"
    provider:
      type: python
      module: chaostoolkit.probes
      func: call_url
      arguments:
        url: "https://api.example.com/health"
    tolerance: 200
    background: true
    pauses:
      - before: 5
      - after: 120

rollbacks:
  - type: action
    name: "roll-back-if-needed"
    provider:
      type: python
      module: chaosk8s.deployment.actions
      func: rollout_deployment
      arguments:
        name: my-application
        ns: production
        revision: "previous"

Running the Experiment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Run with default settings
chaos run experiment.yaml

# Run with detailed logging
chaos run experiment.yaml --log-level=debug

# Run in dry-run mode (no actual failures injected)
chaos run experiment.yaml --dry-run

# Generate an HTML report
chaos report journal.json --export-format=html --output=report.html

The toolkit generates a journal.json file with complete results, including probe outcomes, timing, and whether the steady-state hypothesis held throughout the experiment.

Discovery Mode

One of Chaos Toolkit’s most powerful features is automatic experiment discovery:

1
2
3
4
5
# Discover what chaos actions are available for your Kubernetes namespace
chaos discover chaosk8s --ns=production --target-application=my-app

# This generates a suggested experiment YAML based on your
# actual deployment configuration, probes, and available actions

Key Features

  • Platform-agnostic — works with Kubernetes, AWS, GCP, Azure, Docker, SSH, and any HTTP API
  • Python-based — easy to extend with custom probes and actions
  • Steady-state hypothesis — formal verification that the system remains healthy during chaos
  • Rollback support — automatic rollback if the experiment goes wrong
  • Journal and reporting — structured JSON output with HTML report generation
  • CI/CD integration — run chaos experiments as part of your deployment pipeline
  • Dry-run mode — validate experiment definitions without injecting real failures
  • Extension ecosystem — 30+ community extensions for different platforms and services

When to Choose Chaos Toolkit

Choose Chaos Toolkit if you need platform-agnostic chaos engineering, want to integrate experiments into CI/CD pipelines, prefer code-defined experiments over YAML CRDs, or need to test non-Kubernetes infrastructure like cloud APIs, databases, or bare metal servers.

Feature Comparison

FeatureLitmus ChaosChaos MeshChaos Toolkit
Primary PlatformKubernetesKubernetesMulti-platform
InterfaceWeb UI + YAML CRDsWeb UI + YAML CRDsCLI + YAML/JSON
Fault Types50+ experiments20+ fault typesUnlimited (extensible)
SchedulingBuilt-in cronBuilt-in cronExternal (cron, CI/CD)
Multi-ClusterYes (Litmus Portal)LimitedYes (config-driven)
CI/CD IntegrationVia GitOpsVia Argo WorkflowsNative (Python CLI)
Non-K8s SupportNoPhysical Machine modeYes (AWS, Docker, SSH)
JVM ChaosLimitedFull supportVia extensions
Steady-State ProbesYesYes (via probes)First-class citizen
Automatic RollbackManualManualBuilt-in
CommunityCNCF graduatedCNCF sandboxPython community
Learning CurveMediumMediumLow (if Python-familiar)
DashboardRich web UIRich web UICLI + HTML reports

Deployment Architecture Recommendations

Small Teams (1-5 Engineers)

Start with Chaos Toolkit. It has the lowest barrier to entry — install via pip, write YAML experiments, and run them from your laptop or CI pipeline. No Kubernetes operators to manage, no additional infrastructure to maintain.

1
2
3
# Minimal setup
pip install chaostoolkit chaostoolkit-kubernetes
# Write one experiment, run it weekly, iterate from there

Kubernetes-First Organizations

If your entire stack runs on Kubernetes, Litmus Chaos or Chaos Mesh are better fits. Litmus is the safer bet for teams that want CNCF-graduated stability and a rich experiment catalog. Chaos Mesh is better if you need advanced fault types like JVM chaos, DNS manipulation, or physical machine testing.

1
2
3
4
5
# Litmus for Kubernetes-native chaos
helm install litmus litmuschaos/litmus -n litmus --create-namespace

# Chaos Mesh for advanced fault injection
helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-mesh --create-namespace

Hybrid Infrastructure

For organizations running a mix of Kubernetes, VMs, cloud services, and bare metal, Chaos Toolkit is the most practical choice. Its platform-agnostic design means one experiment framework can cover your entire infrastructure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# One experiment targeting multiple platforms
method:
  - type: action
    name: "kill-k8s-pod"
    provider:
      type: python
      module: chaosk8s.pod.actions
      func: delete_pod
  - type: action
    name: "terminate-ec2-instance"
    provider:
      type: python
      module: chaosaws.ec2.actions
      func: terminate_instances
  - type: action
    name: "stop-docker-container"
    provider:
      type: python
      module: chaosdocker.container.actions
      func: stop_container

Best Practices for Self-Hosted Chaos Engineering

1. Start Small, Scale Gradually

Never begin chaos experiments in production. Follow this progression:

  1. Development environment — Validate experiments work correctly
  2. Staging environment — Test with production-like data and traffic
  3. Canary in production — Run on a single, non-critical service
  4. Full production — Only after multiple successful canary runs

2. Define Clear Steady-State Hypotheses

Every experiment needs measurable success criteria:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
steady-state-hypothesis:
  title: "Service remains healthy"
  probes:
    - tolerance:
        range: [0, 500]  # Latency under 500ms
      provider:
        func: check_latency
    - tolerance:
        range: [0, 0.01]  # Error rate under 1%
      provider:
        func: check_error_rate

3. Implement Blast Radius Controls

Limit the scope of every experiment:

  • Namespace isolation — Only target specific namespaces
  • Label selectors — Only affect pods with specific labels
  • Time windows — Run during low-traffic periods
  • Concurrency limits — Never run more than one experiment at a time
  • Kill switches — Have a manual abort mechanism ready

4. Monitor During Experiments

Always run chaos experiments alongside your observability stack:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Example: Litmus experiment with Prometheus monitoring
experiments:
  - name: pod-delete
    spec:
      probe:
        - name: "prometheus-metrics-check"
          type: promProbe
          promProbeInputs:
            comparator:
              type: "=="
              value: 1
              criteria: ">= 0.95"  # 95% availability
            query: |
              sum(rate(http_requests_total{code=~"2.."}[1m]))
              /
              sum(rate(http_requests_total[1m]))

5. Automate with GitOps

Store all chaos experiments in version control and run them through your CI/CD pipeline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# GitHub Actions workflow for weekly chaos tests
name: Weekly Chaos Experiments
on:
  schedule:
    - cron: "0 4 * * 5"  # Friday at 4 AM
jobs:
  chaos:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run chaos experiments
        run: |
          pip install chaostoolkit chaostoolkit-kubernetes
          chaos run experiments/production/*.yaml
      - name: Upload report
        uses: actions/upload-artifact@v4
        with:
          name: chaos-report
          path: report.html

6. Build a Chaos Culture

The tool is only as effective as the team using it. Establish these practices:

  • Weekly game days — Schedule regular chaos experiments with the whole team
  • Blameless post-mortems — Treat every experiment failure as a learning opportunity
  • Chaos champions — Designate team members responsible for maintaining experiments
  • Runbook updates — Every experiment finding should update your incident runbooks
  • Metrics tracking — Track experiment frequency, failure rate, and MTTR improvements

Common Pitfalls to Avoid

  • Skipping the steady-state hypothesis — Without defined success criteria, you’re just breaking things randomly
  • Running experiments without monitoring — You won’t know if the experiment succeeded or just caused an undetected outage
  • Too broad a blast radius — Starting with “kill all pods in production” is not chaos engineering, it’s sabotage
  • No rollback plan — Always define what happens if the experiment goes wrong
  • One-and-done mentality — Chaos engineering is a continuous practice, not a one-time audit
  • Ignoring the human factor — Tools don’t build resilience; teams do. Include game days in your process

Final Verdict: Which Platform to Choose

ScenarioRecommended Platform
Kubernetes-only, want maturity and governanceLitmus Chaos
Kubernetes + need advanced fault types (JVM, DNS, network)Chaos Mesh
Multi-platform, CI/CD integration, Python-friendlyChaos Toolkit
Small team getting started quicklyChaos Toolkit
Enterprise with multiple clustersLitmus Chaos
Need visual workflow constructionChaos Mesh

The best approach for most organizations is to start with one platform, run experiments regularly, and expand coverage over time. Chaos engineering is not about breaking systems — it’s about building confidence that your systems won’t break when it matters most.

All three platforms are free, open-source, and self-hosted. Pick the one that matches your infrastructure, write your first experiment today, and start building systems that survive the real world.

Frequently Asked Questions (FAQ)

Which one should I choose in 2026?

The best choice depends on your specific requirements:

  • For beginners: Start with the simplest option that covers your core use case
  • For production: Choose the solution with the most active community and documentation
  • For teams: Look for collaboration features and user management
  • For privacy: Prefer fully open-source, self-hosted options with no telemetry

Refer to the comparison table above for detailed feature breakdowns.

Can I migrate between these tools?

Most tools support data import/export. Always:

  1. Backup your current data
  2. Test the migration on a staging environment
  3. Check official migration guides in the documentation

Are there free versions available?

All tools in this guide offer free, open-source editions. Some also provide paid plans with additional features, priority support, or managed hosting.

How do I get started?

  1. Review the comparison table to identify your requirements
  2. Visit the official documentation (links provided above)
  3. Start with a Docker Compose setup for easy testing
  4. Join the community forums for troubleshooting
Advertise here