Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to infrastructure and operations challenges, originally developed at Google to manage large-scale production systems. SRE bridges the traditional divide between development and operations by treating operations as a software problem, using automation, error budgets, and service level objectives to balance the tension between releasing new features and maintaining system stability. At its core, SRE embraces calculated risk rather than pursuing perfection—availability targets like 99.9% explicitly acknowledge that some downtime is acceptable, and the remaining error budget becomes a shared resource that development and operations teams negotiate to make data-driven decisions about velocity versus reliability.
What This Cheat Sheet Covers
This topic spans 15 focused tables and 114 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core SRE Principles
| Principle | Example | Description |
|---|---|---|
Target 99.9% uptime (43.8 min downtime/month) | • Accepting calculated risk rather than 100% reliability • push features faster within defined error budget limits | |
API latency p99 < 200ms | • Quantitative reliability targets agreed between product and SRE • the foundation for error budgets | |
Automating manual password resets | • Reducing repetitive manual work that scales linearly with service growth • aim for <50% toil time | |
Alert on symptom (user-facing errors) not cause (disk space) | • Monitor user-facing symptoms not internal metrics • focus on what users experience |