Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Site Reliability Engineering (SRE) Cheat Sheet

Site Reliability Engineering (SRE) Cheat Sheet

Back to DevOps
Updated 2026-03-19
Next Topic: Terraform Cheat Sheet

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to infrastructure and operations challenges, originally developed at Google to manage large-scale production systems. SRE bridges the traditional divide between development and operations by treating operations as a software problem, using automation, error budgets, and service level objectives to balance the tension between releasing new features and maintaining system stability. At its core, SRE embraces calculated risk rather than pursuing perfection—availability targets like 99.9% explicitly acknowledge that some downtime is acceptable, and the remaining error budget becomes a shared resource that development and operations teams negotiate to make data-driven decisions about velocity versus reliability.

What This Cheat Sheet Covers

This topic spans 15 focused tables and 114 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core SRE PrinciplesTable 2: Service Level Concepts (SLI, SLO, SLA)Table 3: Golden Signals of MonitoringTable 4: Incident Response & On-Call PracticesTable 5: Toil Reduction & AutomationTable 6: Monitoring, Alerting & ObservabilityTable 7: Capacity Planning & ScalingTable 8: Deployment & Release StrategiesTable 9: Blameless Culture & PostmortemsTable 10: Reliability Patterns & Architectural PracticesTable 11: SRE Team Models & OrganizationTable 12: DevOps vs SRE ComparisonTable 13: Advanced SRE PracticesTable 14: Common SRE Metrics & KPIsTable 15: SRE Tools & Technologies

Table 1: Core SRE Principles

PrincipleExampleDescription
Embracing Risk
Target 99.9% uptime
(43.8 min downtime/month)
• Accepting calculated risk rather than 100% reliability
• push features faster within defined error budget limits
Service Level Objectives (SLOs)
API latency p99 < 200ms
• Quantitative reliability targets agreed between product and SRE
• the foundation for error budgets
Eliminating Toil
Automating manual password resets
• Reducing repetitive manual work that scales linearly with service growth
• aim for <50% toil time
Monitoring Distributed Systems
Alert on symptom (user-facing errors)
not cause (disk space)
• Monitor user-facing symptoms not internal metrics
• focus on what users experience

More in DevOps

  • Service Level Objectives Cheat Sheet
  • Terraform Cheat Sheet
  • Ansible Cheat Sheet
  • CircleCI Cheat Sheet
  • DevSecOps Cheat Sheet
  • Infrastructure as Code Cheat Sheet
View all 33 topics in DevOps