Pricing Blog Compare Glossary
Login Start Free

What is Chaos Engineering?

Chaos engineering is the discipline of experimenting on a distributed system to discover weaknesses before they cause production outages. By intentionally injecting failures — killing instances, inducing latency, exhausting resources — you build confidence that your system can survive the inevitable failures that occur in production.

Definition

Chaos engineering is the practice of running controlled experiments on production systems to validate that they will continue functioning despite the injection of failures and adverse conditions. The core principle is: "Assume your systems will fail, and prove they survive."

Rather than waiting for Murphy's Law to strike at 2 AM on a holiday weekend, chaos engineers proactively break things during business hours when the team is awake and monitoring. Each experiment validates a specific hypothesis about system resilience and reveals gaps in architecture, configuration, or observability.

The Netflix Origin Story: Chaos Monkey

In 2010, Netflix migrated from a traditional data center to Amazon Web Services. The cloud promised elasticity and reliability, but Netflix engineers weren't convinced. What if instances crashed without warning? What if entire availability zones failed?

Instead of hoping for the best, they built Chaos Monkey — a tool that randomly terminates EC2 instances in production every weekday. Initially controversial, this forced Netflix engineers to design systems that could survive instance failure. Load balancers detected dead instances and routed traffic elsewhere. Stateless services restarted automatically. Databases replicated across zones.

The result: Netflix's infrastructure became dramatically more resilient. When real AWS outages occurred — and they did — Netflix stayed online while competitors went down. What started as a provocative experiment became industry best practice.

Today, Chaos Monkey is open-source, and Netflix has extended it into a full chaos engineering suite called the Simian Army. The lesson is clear: controlled failure in production is safer than uncontrolled failure at 3 AM.

Core Principles of Chaos Engineering

Effective chaos engineering follows five principles:

Hypothesis-Driven Experiments

Start with a specific hypothesis: "If we kill one database replica, the remaining replicas will handle 150% traffic" or "If we inject 500ms latency on one service, requests will timeout and retry successfully." Before running the experiment, predict the outcome. Afterward, verify your prediction was correct.

Minimize Blast Radius

Start small. Kill one instance, not ten. Add 100ms latency, not 10 seconds. Affect 1% of traffic, not 50%. If something goes wrong, the blast radius is small. As confidence builds, gradually increase the scope.

Constant Monitoring

Never run chaos experiments without robust monitoring. Watch error rates, latency percentiles (p50, p95, p99), resource utilization, business metrics, and log patterns. If anything deviates unexpectedly from baseline, stop the experiment immediately.

Learn and Improve

After each experiment, document what happened. Did the system behave as predicted? What surprised you? What architectural assumptions were wrong? Use these insights to improve code, configuration, runbooks, and monitoring.

Run During Business Hours

Schedule experiments when engineers are awake and monitoring dashboards. Running chaos at 2 AM defeats the purpose — you want to observe system behavior, not wake up to a PagerDuty alert.

Types of Chaos Experiments

Chaos engineers inject different types of failures depending on what they want to test:

Instance/Container Termination

Kill a random instance or pod. Does the load balancer detect it and route traffic elsewhere? Does the orchestrator restart it?

chaos_monkey() kills one EC2 instance every weekday

Network Partition (Split-Brain)

Simulate a network failure between datacenters or services. Does the system survive split-brain? Do clients timeout appropriately?

Block traffic between database replicas for 30 seconds

Latency Injection

Add artificial delay to requests (e.g., 500ms, 1s, 5s). Test timeout handling and retry logic.

Inject 1000ms latency on 5% of database queries

Resource Exhaustion

Consume CPU, memory, disk, or network bandwidth. Does the service degrade gracefully or crash?

Consume 90% of available memory on a service instance

Dependency Failure

Simulate failure of dependencies: shutdown the database, block DNS resolution, kill the cache layer.

Return 500 errors from the payment API for 60 seconds

Packet Loss / Network Degradation

Drop 10-30% of packets to simulate poor network conditions. Test retry logic and circuit breakers.

Drop 20% of packets between load balancer and backend

Chaos Engineering Tools

Modern chaos engineering platforms automate experiment design, execution, and analysis:

Tool Focus Use Case
Chaos Monkey Random instance termination AWS EC2 resilience testing
Gremlin Comprehensive chaos (enterprise) Production chaos at scale (SaaS platform)
Litmus Kubernetes-native chaos ChaosEngineering in Kubernetes clusters
Toxiproxy Network simulation (latency, packet loss) Testing service dependencies and timeouts
Pumba Docker container chaos Testing containerized applications
Chaos Toolkit Declarative experiments (YAML-based) Framework-agnostic chaos testing

Choosing a tool: Start with open-source tools (Chaos Monkey, Litmus, Toxiproxy) in staging. As your chaos practice matures, consider enterprise platforms (Gremlin) for better controls, reporting, and coordination across teams.

Game Days: Chaos on Steroids

A Game Day is a scheduled, coordinated chaos engineering event where the entire on-call team participates. Instead of running one small experiment, the chaos team runs dozens of simultaneous injections to simulate a major incident.

Example Game Day Scenario

T+0:00 Kill 3 backend instances in the US-East region

T+2:00 Inject 2000ms latency on the payment API

T+5:00 Fail over the primary database replica

T+8:00 Simulate a cascade: 30% packet loss on the queue

T+15:00 Trigger DNS resolution failures

During the Game Day, engineers:

  • Monitor dashboards and respond to alerts (just like a real incident)
  • Discover bugs in runbooks and alerting logic
  • Identify blind spots in observability
  • Practice incident communication and coordination
  • Validate that trained on-call engineers follow procedures

Pro tip: Record what happens during the Game Day. Afterward, conduct a thorough post-incident review (PIR). What went well? What surprised you? What should we fix before a real incident occurs?

Why Monitoring is Essential for Chaos Engineering

You cannot run chaos experiments safely without excellent monitoring. Good monitoring serves three critical functions:

1. Establish Baseline Behavior

Before you inject chaos, measure normal system behavior. What's the baseline error rate? Latency? CPU usage? Memory? Disk I/O? You need this data to know what "abnormal" looks like.

Example baseline: p99 latency = 200ms, error rate = 0.1%, CPU = 45%

2. Observe System Behavior During Chaos

While the chaos experiment runs, watch metrics constantly. Does error rate spike? Does latency increase? Do timeouts occur? How do dependent services react? If anything deviates far beyond expected, stop the experiment.

3. Verify Recovery

After chaos ends, confirm the system recovers completely. Do metrics return to baseline? Is there any lingering degradation? How long does recovery take? Document these recovery metrics.

Golden signals to monitor: Latency (p50, p95, p99), error rate, traffic volume, and saturation (CPU, memory, disk, network). For business-critical systems, also track conversion rate, transaction count, and revenue.

Getting Started with Chaos Engineering

Don't jump straight to production. Build experience gradually:

Phase 1: Staging (1-2 weeks)

Run simple experiments (kill one container, add 100ms latency) in staging. Get comfortable with tools and processes. No customer impact if something goes wrong.

Phase 2: Off-Peak Production (2-4 weeks)

Graduate to production during low-traffic times (Sunday 3 AM). Run slightly larger experiments (kill 2 instances, 500ms latency). Have on-call engineer monitoring.

Phase 3: Business Hours Production (ongoing)

Run experiments during business hours (2-4 PM) when full team is online. Scale to larger failures. Schedule regular Game Days to practice incident response.

Frequently Asked Questions

What is chaos engineering?

Chaos engineering is a discipline of experimenting on a software system in production to build confidence in the system's ability to withstand turbulent conditions. Instead of waiting for failures to occur naturally, chaos engineers proactively inject failures and observe how the system responds. This reveals weaknesses before they cause real outages.

Why did Netflix invent Chaos Monkey?

Netflix built Chaos Monkey in 2010 because they wanted to move to cloud infrastructure but didn't trust the new environment. Rather than hoping the cloud would be reliable, they built a tool that randomly terminated EC2 instances in production. This forced them to design systems that could handle instance failure gracefully. The strategy worked — their infrastructure became much more resilient.

What types of chaos experiments can I run?

Common chaos experiments include: terminating instances or containers, inducing network latency or packet loss, killing database connections, exhausting CPU or memory, degrading disk I/O, simulating DNS failures, and triggering application-level exceptions. The key is to start small (single instance) and gradually increase blast radius.

Is chaos engineering the same as disaster recovery testing?

No. Disaster recovery (DR) testing validates that you can restore from backups — it's a planned, scripted process. Chaos engineering tests system resilience under stress — it's exploratory and hypothesis-driven. Both are important: DR ensures data recovery, chaos ensures the system can survive failures without needing recovery.

Do I need to run chaos experiments in production?

Ideally, yes. Pre-production chaos testing is safer and faster, but production is more realistic. Many teams start in staging, graduate to off-peak production times, then expand to business hours once confidence builds. The key is starting small and monitoring closely.

What monitoring do I need for chaos experiments?

Excellent monitoring is essential. Before injecting chaos, establish baseline metrics: response time, error rate, CPU usage, memory, disk I/O, and business metrics (conversion rate, API latency percentiles). During the experiment, watch these metrics closely. If they degrade unexpectedly, stop the experiment. After recovery, analyze what happened.

Definition

Run chaos experiments with confidence. AtomPing's multi-region monitoring instantly detects when experiments impact availability. Monitor real-time system behavior across 10 European regions. Free forever plan includes 50 monitors.

Start Monitoring Free