Chaos Engineering is a disciplined approach to identifying failures before they become outages by intentionally injecting controlled faults into systems. Born at Netflix from cloud migration challenges, this methodology transforms how organizations build resilience through systematic experimentation rather than reactive fire-fighting. The core insight: systems will fail — the question is whether you discover weaknesses during a planned experiment or during a 3 AM production incident. Unlike traditional testing that validates what you expect to work, chaos engineering reveals what you don't yet know can break.
What This Cheat Sheet Covers
This topic spans 22 focused tables and 148 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Principles
The five foundational principles of chaos engineering, codified at principlesofchaos.org, define what separates disciplined resilience testing from random destruction. Each principle addresses a specific failure mode of poorly-run experiments.
| Principle | Example | Description |
|---|---|---|
latency_p99 < 500mserror_rate < 0.1%throughput > 1000 rps | • Define measurable normal behavior using business metrics before injecting failures • experiments validate whether the system returns to steady state after the fault is removed. | |
Inject EC2 termination Simulate network partition Exhaust memory pools | • Focus on realistic failure scenarios that actually occur in production environments • avoid synthetic or unrealistic faults. | |
Target 1% of instances Limit to single AZ Single service scope | • Start with smallest possible impact and expand gradually • always implement abort mechanisms and rollback triggers. |