Site Reliability Engineering (SRE) Cheat Sheet

Updated 2026-05-28

Next Topic: Snyk Developer Security Platform Cheat Sheet

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to infrastructure and operations challenges, originally developed at Google to manage large-scale production systems. SRE bridges the traditional divide between development and operations by treating operations as a software problem, using automation, error budgets, and service level objectives to balance the tension between releasing new features and maintaining system stability. At its core, SRE embraces calculated risk rather than pursuing perfection—availability targets like 99.9% explicitly acknowledge that some downtime is acceptable, and the remaining error budget becomes a shared resource that development and operations teams negotiate to make data-driven decisions about velocity versus reliability.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 141 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core SRE PrinciplesTable 2: Service Level Concepts (SLI, SLO, SLA)Table 3: Golden Signals of MonitoringTable 4: Incident Response & On-Call PracticesTable 5: Toil Reduction & AutomationTable 6: Monitoring, Alerting & ObservabilityTable 7: Capacity Planning & ScalingTable 8: Reliability Testing TypesTable 9: Deployment & Release StrategiesTable 10: Blameless Culture & PostmortemsTable 11: Reliability Patterns & Architectural PracticesTable 12: SRE Team Models & OrganizationTable 13: DevOps vs SRE ComparisonTable 14: Advanced SRE PracticesTable 15: Common SRE Metrics & KPIsTable 16: SRE Tools & Technologies

Table 1: Core SRE Principles

SRE's power comes not from any single technique but from the interplay of its principles — error budgets make risk quantifiable, toil tracking keeps engineering work meaningful, and simplicity is the discipline that holds everything together. Understanding why each principle exists is more important than memorising the names.

Principle	Example	Description
Embracing Risk	Target 99.9% uptime (43.8 min downtime/month)	• Accepting calculated risk rather than 100% reliability • push features faster within defined error budget limits
Service Level Objectives (SLOs)	API latency p99 < 200ms	• Quantitative reliability targets agreed between product and SRE • the foundation for error budgets
Eliminating Toil	Automating manual password resets	• Reducing repetitive manual work that scales linearly with service growth • aim for <50% toil time
Monitoring Distributed Systems	Alert on symptom (user-facing errors) not cause (disk space)	• Monitor user-facing symptoms not internal metrics • focus on what users experience

Table 1: Core SRE Principles

Principle	Example	Description
Embracing Risk	Target 99.9% uptime (43.8 min downtime/month)	• Accepting calculated risk rather than 100% reliability • push features faster within defined error budget limits
Service Level Objectives (SLOs)	API latency p99 < 200ms	• Quantitative reliability targets agreed between product and SRE • the foundation for error budgets
Eliminating Toil	Automating manual password resets	• Reducing repetitive manual work that scales linearly with service growth • aim for <50% toil time
Monitoring Distributed Systems	Alert on symptom (user-facing errors) not cause (disk space)	• Monitor user-facing symptoms not internal metrics • focus on what users experience