Backend observability encompasses the practices, tools, and methodologies for understanding the internal state of distributed systems through external outputs—primarily metrics, logs, and traces. This discipline emerged as microservices and cloud-native architectures made traditional monitoring insufficient; you can no longer simply check if a server is up—you must understand how requests flow through dozens of services, where latency spikes occur, and why errors happen. Modern observability combines APM, distributed tracing with OpenTelemetry, structured logging, metric collection with Prometheus, and incident response workflows into a unified approach. The key insight: observability isn't about collecting more data—it's about asking better questions when things break, using context propagation to connect dots across services, establishing Service Level Objectives that align reliability investments with business needs, and keeping costs under control with intelligent telemetry filtering and sampling.
What This Cheat Sheet Covers
This topic spans 23 focused tables and 159 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: The Three Pillars of Observability
The three pillars—metrics, logs, and traces—each answer a different question: how much (metrics), what happened (logs), and why (traces). The power of modern observability comes from correlating all three using shared identifiers rather than treating them as separate silos.
| Pillar | Example | Description |
|---|---|---|
http_requests_total{method="GET", status="200"} 45231 | • Numerical measurements aggregated over time windows • cheap to store and query, ideal for trend analysis and alerting but lack context about individual requests | |
{"timestamp": "2026-05-28T10:23:45Z", "level": "error", "trace_id": "a3f2...", "msg": "DB timeout"} | • Discrete event records capturing what happened at a specific moment • provide rich context and debugging details but expensive at scale and hard to aggregate across services | |
Trace: a3f2... → API Gateway (12ms) → Auth Service (45ms) → DB Query (203ms) | • End-to-end request journeys showing execution flow across distributed services • reveal latency bottlenecks and service dependencies that metrics and logs miss in isolation |