Chaos Engineering Cheat Sheet

Updated 2026-05-28

Chaos Engineering is a disciplined approach to identifying failures before they become outages by intentionally injecting controlled faults into systems. Born at Netflix from cloud migration challenges, this methodology transforms how organizations build resilience through systematic experimentation rather than reactive fire-fighting. The core insight: systems will fail — the question is whether you discover weaknesses during a planned experiment or during a 3 AM production incident. Unlike traditional testing that validates what you expect to work, chaos engineering reveals what you don't yet know can break.

What This Cheat Sheet Covers

This topic spans 22 focused tables and 148 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core PrinciplesTable 2: Experiment Design StagesTable 3: Failure Injection TechniquesTable 4: Infrastructure Failure ScenariosTable 5: Application-Level FailuresTable 6: Network Chaos ExperimentsTable 7: Observability RequirementsTable 8: Safety Controls and GuardrailsTable 9: Industry Tools and PlatformsTable 10: Specialized Chaos ToolsTable 11: GameDays and Disaster RecoveryTable 12: Chaos in CI/CD PipelinesTable 13: Maturity Model StagesTable 14: Organizational ConsiderationsTable 15: Measuring Resilience ImpactTable 16: Security Chaos EngineeringTable 17: Continuous vs Scheduled ChaosTable 18: Anti-Patterns to AvoidTable 19: Cloud Provider Chaos OptionsTable 20: Advanced Experimentation TechniquesTable 21: Serverless Chaos EngineeringTable 22: AI/LLM Chaos Engineering

Table 1: Core Principles

The five foundational principles of chaos engineering, codified at principlesofchaos.org, define what separates disciplined resilience testing from random destruction. Each principle addresses a specific failure mode of poorly-run experiments.

Principle	Example	Description
Steady State Hypothesis	`latency_p99 < 500ms` `error_rate < 0.1%` `throughput > 1000 rps`	• Define measurable normal behavior using business metrics before injecting failures • experiments validate whether the system returns to steady state after the fault is removed.
Vary Real-World Events	Inject EC2 termination Simulate network partition Exhaust memory pools	• Focus on realistic failure scenarios that actually occur in production environments • avoid synthetic or unrealistic faults.
Minimize Blast Radius	Target 1% of instances Limit to single AZ Single service scope	• Start with smallest possible impact and expand gradually • always implement abort mechanisms and rollback triggers.

Table 1: Core Principles

Principle	Example	Description
Steady State Hypothesis	`latency_p99 < 500ms` `error_rate < 0.1%` `throughput > 1000 rps`	• Define measurable normal behavior using business metrics before injecting failures • experiments validate whether the system returns to steady state after the fault is removed.
Vary Real-World Events	Inject EC2 termination Simulate network partition Exhaust memory pools	• Focus on realistic failure scenarios that actually occur in production environments • avoid synthetic or unrealistic faults.
Minimize Blast Radius	Target 1% of instances Limit to single AZ Single service scope	• Start with smallest possible impact and expand gradually • always implement abort mechanisms and rollback triggers.