Incident management is the structured approach to detecting, responding to, and learning from service disruptions that impact customers or business operations. Born from Site Reliability Engineering (SRE) and DevOps practices, modern incident management balances speed with coordination—requiring clear roles, escalation paths, and communication protocols to minimize customer impact. The blameless postmortem, a core practice borrowed from aviation safety, shifts focus from individual blame to system-level learning, creating psychological safety that encourages honest reporting and deeper root cause analysis. Understanding incident severity classification, SLA definitions, and metrics like MTTR becomes foundational not just for restoring service, but for building organizational resilience and preventing recurrence.
What This Cheat Sheet Covers
This topic spans 22 focused tables and 111 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Incident Severity Levels
Incident severity classification creates a shared language for prioritizing response efforts based on customer impact and business criticality. Organizations typically use either P0-P4 priority levels or SEV0-SEV5 severity tiers to determine response times, escalation paths, and resource allocation. Clear severity definitions reduce confusion during high-pressure moments, ensure appropriate stakeholder notification, and drive SLA compliance.
| Level | Example | Description |
|---|---|---|
Complete platform outage affecting all customers | • Total service unavailability with no workaround; all hands on deck response required immediately • Typically reserved for business-ending failures | |
Core payment processing down for 50% of users | • Core functionality broken for majority of users • immediate response within 15 minutes • executive notification required | |
Search feature degraded; manual workaround available | • Important features degraded but service remains partially functional • response within 1 hour • impacts subset of users |