Chaos Engineering is a disciplined approach to identifying failures before they become outages by intentionally injecting controlled faults into systems. Born at Netflix from cloud migration challenges, this methodology transforms how organizations build resilience through systematic experimentation rather than reactive fire-fighting. The core insight: systems will fail — the question is whether you discover weaknesses during a planned experiment or during a 3 AM production incident. Unlike traditional testing that validates what you expect to work, chaos engineering reveals what you don't yet know can break.
What This Cheat Sheet Covers
This topic spans 20 focused tables and 124 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Principles
| Principle | Example | Description |
|---|---|---|
latency_p99 < 500mserror_rate < 0.1%throughput > 1000 rps | • Define measurable normal behavior using business metrics before injecting failures • experiments validate whether the system returns to steady state. | |
Inject EC2 termination Simulate network partition Exhaust memory pools | • Focus on realistic failure scenarios that actually occur in production environments • avoid synthetic or unrealistic faults. | |
Start in staging, progress to production canaries, then full production | • Ultimately chaos must test the actual production system under real load • staging approximations miss critical interactions. |