Building resilient distributed systems requires more than unit tests and load tests. You need to verify your applications survive real-world failures: network latency, dropped packets, service crashes, and infrastructure outages. Fault injection tools let you deliberately break things in staging environments so your production systems can handle failures gracefully.
This guide compares three self-hosted fault injection and chaos testing tools: Toxiproxy (Shopify), Pumba (Aleksei Babenko), and Chaos Monkey (Netflix). Each represents a different approach to chaos engineering — from network-level proxy manipulation to container lifecycle disruption to infrastructure-level instance termination.
For teams already practicing chaos engineering at the kubernetes level, our chaos engineering guide covering Litmus, Chaos Mesh, and Chaos Toolkit provides complementary tools. This article focuses on the network, container, and infrastructure layers where fault injection starts.
Why Self-Host Fault Injection Tools?
Fault injection is a core practice of chaos engineering — the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions. While cloud providers offer managed chaos testing services, self-hosted tools give you:
- Full control over failure scenarios — define exactly what breaks, when, and how
- No data leaving your network — all fault injection runs on your infrastructure
- CI/CD integration — run chaos tests automatically before every deployment
- Cost predictability — no per-test or per-hour charges from cloud vendors
- Customizable toxics — write your own fault injection logic when built-in options aren’t enough
Whether you are testing database connection resilience, verifying retry logic, or validating circuit breakers, self-hosted fault injection tools integrate directly into your development and staging pipelines.
Toxiproxy: Network Proxy for Deterministic Chaos
Toxiproxy was built by Shopify to prove that their applications don’t have single points of failure. It acts as a TCP proxy between your application and its dependencies (databases, caches, APIs), allowing you to inject specific network conditions on demand.
GitHub Stats: 11,971 stars · Last updated April 2026 · Written in Go
How Toxiproxy Works
Toxiproxy sits between your application and downstream services. When you configure a proxy, all traffic flows through Toxiproxy first, which can then apply “toxics” — fault conditions like latency, bandwidth limits, connection resets, or timeouts.
| |
Toxiproxy docker Compose Setup
| |
Toxiproxy Configuration
Create toxiproxy-config/proxies.json to define your proxies:
| |
Injecting Network Faults with Toxiproxy
Once the proxy is running, use the CLI or HTTP API to inject faults:
| |
Toxiproxy Toxic Types
| Toxic | Effect | Use Case |
|---|---|---|
latency | Adds delay to data transmission | Test timeout handling and retry logic |
bandwidth | Limits throughput rate | Verify behavior under slow network |
slow_close | Delays TCP socket close | Test connection pool resilience |
timeout | Stops all data transmission | Simulate complete network outage |
reset_peer | Sends TCP RST packets | Test reconnection logic |
limit_data | Closes connection after N bytes | Test partial response handling |
Toxiproxy’s strength is deterministic testing. You control exactly which connections are affected and when, making it ideal for automated test suites. For teams that also use API mock servers in their testing pipeline, Toxiproxy pairs well with tools like WireMock, Mockoon, and MockServer to create comprehensive test environments.
Pumba: Container-Level Chaos Engineering
Pumba brings chaos engineering directly to Docker and containerd containers. Inspired by Netflix Chaos Monkey, Pumba can kill, stop, pause containers, inject network delays, and stress-test resources — all from a single tool.
GitHub Stats: 3,011 stars · Last updated April 2026 · Written in Go
How Pumba Works
Pumba connects to the Docker API or containerd API to manage target containers. For network chaos, it creates a helper container that shares the target’s network namespace, allowing it to manipulate traffic using Linux tc (traffic control) and iptables.
Pumba supports two network chaos methods:
- netem — manipulates outgoing traffic using Linux
tc: delay, packet loss, corruption, duplication, rate limiting - iptables — manipulates incoming traffic using
iptables: packet loss with random or nth-packet matching
Pumba Docker Setup
Pumba runs as a container itself, mounting the Docker socket to manage other containers:
| |
Pumba Chaos Commands
| |
Pumba Network Chaos Options
| Command | Effect | Underlying Technology |
|---|---|---|
netem delay | Add latency to egress traffic | Linux tc (traffic control) |
netem loss | Drop outgoing packets | Linux tc netem |
netem corrupt | Corrupt packet data | Linux tc netem |
netem duplicate | Duplicate packets | Linux tc netem |
netem rate | Limit bandwidth | Linux tc TBF |
iptables loss | Drop incoming packets | Linux iptables |
Pumba’s key advantage is container lifecycle management. Unlike Toxiproxy (which only affects network traffic), Pumba can kill, stop, pause, and restart entire containers. Combined with its ability to stress CPU and memory, it provides comprehensive chaos testing for containerized environments.
Chaos Monkey: Infrastructure-Level Resilience Testing
Chaos Monkey is the original chaos engineering tool created by Netflix. It randomly terminates virtual machine instances and containers running in production, forcing engineers to build services that survive infrastructure failures.
GitHub Stats: 16,825 stars · Last updated January 2025 · Written in Go
How Chaos Monkey Works
Chaos Monkey integrates with Spinnaker, Netflix’s continuous delivery platform. It discovers all running applications and their instances, then randomly terminates a subset during business hours. The randomness ensures that failures cannot be predicted or worked around — your systems must be genuinely resilient.
Chaos Monkey operates at the infrastructure layer, terminating entire VMs or containers rather than manipulating individual network connections. This tests a different failure mode than Toxiproxy or Pumba.
Chaos Monkey Configuration
Chaos Monkey is configured through a YAML file that defines termination policies:
| |
Deployment with Spinnaker
| |
Chaos Monkey Scheduling
Chaos Monkey respects configurable business hours and can be tuned for different environments:
| Setting | Description | Recommended Value |
|---|---|---|
meanTimeBetweenKillsInWorkDays | Average days between terminations per group | 1-5 for staging, 5-30 for production |
minTimeBetweenKillsInWorkDays | Minimum gap between terminations | 1 day |
groupings | How to group instances for targeting | cluster, region, or app |
whitelist | Applications exempt from termination | Leave empty for maximum coverage |
Important note: Chaos Monkey requires Spinnaker as a dependency, which adds significant operational overhead. If you do not already run Spinnaker, consider Toxiproxy or Pumba for simpler setups.
Feature Comparison: Toxiproxy vs Pumba vs Chaos Monkey
| Feature | Toxiproxy | Pumba | Chaos Monkey |
|---|---|---|---|
| Primary Focus | Network proxy toxics | Container chaos | Infrastructure termination |
| Fault Types | Latency, bandwidth, timeout, reset | Kill, stop, pause, netem, stress | Instance termination |
| Network Control | TCP-level proxy | tc + iptables | None (full instance kill) |
| Container Support | Docker, standalone | Docker, containerd | Spinnaker-managed |
| CI/CD Integration | Excellent (HTTP API + CLI) | Good (CLI-driven) | Limited (requires Spinnaker) |
| Deterministic Testing | Yes (exact control) | Yes (targeted patterns) | No (random by design) |
| Setup Complexity | Low | Low | High (Spinnaker required) |
| Language | Go | Go | Go |
| Stars | 11,971 | 3,011 | 16,825 |
| Best For | Application-level resilience | Container resiliency | Infrastructure resiliency |
Choosing the Right Tool
Use Toxiproxy When:
- You need deterministic, reproducible fault injection for automated tests
- You want to test specific network conditions (latency, bandwidth, connection drops)
- Your architecture uses TCP-based services (databases, caches, message queues)
- You need fine-grained control over which connections are affected
- You want to integrate fault injection into CI/CD pipelines
Use Pumba When:
- You run Docker or containerd workloads and want container-level chaos
- You need to test container lifecycle failures (kill, stop, pause, restart)
- You want to stress-test CPU and memory alongside network chaos
- You need asymmetric network control (different rules for incoming vs outgoing)
- You prefer a single tool for multiple chaos types
Use Chaos Monkey When:
- You already run Spinnaker for continuous delivery
- You want to test infrastructure-level failures (VM termination)
- Your team is ready for production chaos engineering
- You need random, unpredictable failure injection (not deterministic testing)
- You operate at cloud scale with many instances per service
Combining Tools: A Real-World Testing Pipeline
The most resilient teams use multiple tools together. Here is a practical testing pipeline:
| |
This setup runs load tests while Toxiproxy injects network latency and Pumba introduces container-level disruptions, giving you a realistic picture of how your system handles compound failures.
If you are also interested in measuring system behavior under sustained traffic, check out our k6 vs Locust vs Gatling load testing comparison for complementary tools that pair well with fault injection.
FAQ
What is the difference between chaos engineering and fault injection?
Chaos engineering is the broader discipline of experimenting on systems to build confidence in their ability to withstand turbulent production conditions. Fault injection is a specific technique within chaos engineering where you deliberately introduce faults (network delays, service crashes, packet loss) to test how systems respond. Toxiproxy focuses on fault injection at the network layer, while Chaos Monkey practices chaos engineering at the infrastructure level.
Can I use Toxiproxy in production?
Toxiproxy is primarily designed for testing and staging environments. While it is production-capable, most teams use it during development and CI/CD to verify resilience before deploying. In production, tools like Chaos Monkey or Pumba (with careful scheduling) are more appropriate because they test real infrastructure failures rather than simulated network conditions.
Does Pumba require root access?
Pumba needs access to the Docker socket (/var/run/docker.sock) which typically requires root or docker group membership. For network chaos using tc and iptables, the helper container requires the NET_ADMIN capability. In production, configure appropriate RBAC policies and consider running Pumba in a dedicated chaos testing namespace.
How do I choose between Toxiproxy and Pumba?
Choose Toxiproxy if you need fine-grained control over individual network connections (e.g., adding latency to specific database queries). Choose Pumba if you want to test container lifecycle failures (killing, pausing, restarting containers) alongside network chaos. For comprehensive testing, use both: Toxiproxy for connection-level faults and Pumba for container-level disruptions.
Is Chaos Monkey still actively maintained?
Chaos Monkey’s last significant update was in early 2025, and it remains stable but not actively developed. The tool is mature and production-proven at Netflix’s scale. However, many teams have moved toward newer alternatives like Chaos Mesh or Litmus Chaos for Kubernetes-native chaos engineering. Chaos Monkey’s main requirement — Spinnaker integration — can be a barrier for teams not already using that platform.
How do I automate fault injection in my CI/CD pipeline?
Toxiproxy is the easiest to automate thanks to its HTTP API. A typical CI/CD flow looks like:
| |
Pumba can also be scripted via its CLI with --interval flags for recurring chaos patterns during extended test runs.