Software resilience patterns are architectural strategies designed to build fault-tolerant, self-healing systems that continue functioning despite failures, network issues, or overload conditions. In distributed systems, where failures are inevitable rather than exceptional, resilience engineering shifts from preventing failures to designing systems that gracefully handle them. These patterns—from circuit breakers that prevent cascading failures to chaos engineering that deliberately injects faults—form the foundation of modern production systems at scale. Understanding not just what each pattern does but when and why to apply it transforms fragile systems into robust, production-ready architectures that survive the chaos of real-world operations.
What This Cheat Sheet Covers
This topic spans 18 focused tables and 153 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Circuit Breaker States & Behavior
The circuit breaker pattern monitors calls to a downstream service and stops requests when failure rates exceed a threshold, giving the service time to recover. Mastering the state machine — including when transitions occur and what thresholds trigger them — is essential for tuning circuit breakers without causing false trips or masking real outages.
| State | Example | Description |
|---|---|---|
circuitBreaker.state = CLOSEDrequest → downstream service | • Requests flow normally to the downstream service • failure counter tracks errors against a threshold (e.g., 5 failures in 10 seconds) before opening. | |
circuitBreaker.state = OPENrequest → immediate FailFastException | • All requests fail immediately without calling the service • protects downstream by preventing further load • transitions to half-open after a timeout period (e.g., 60 seconds). | |
circuitBreaker.state = HALF_OPENlimited test requests → service | • Allows a limited number of test requests (e.g., 3) to check if the service recovered • success → transitions to closed • failure → transitions back to open. | |
failureThreshold = 5errorPercentage = 50% | • Trigger condition for opening the circuit • can be absolute count (5 failures) or percentage (50% error rate) within a sliding time window. | |
slowCallDuration = 2sslowCallRateThreshold = 50% | • Circuit opens when percentage of slow calls exceeds threshold, even if calls succeed • protects downstream from load that would eventually time out • Resilience4j-specific configuration. |