LLM Observability Cheat Sheet

Updated 2026-04-28

Next Topic: LLM Orchestration Cheat Sheet

LLM observability is the practice of monitoring, measuring, and understanding the behavior of large language models in production environments, enabling teams to track quality, performance, cost, and security across AI applications. Unlike traditional software observability, LLM observability must capture the non-deterministic nature of generative AI—tracking prompt inputs, model outputs, token usage, latency, hallucinations, and user feedback across complex multi-step workflows. As LLMs power increasingly critical business applications in 2026, observability has shifted from a nice-to-have debugging tool to production infrastructure essential for reliability, compliance, and cost control. The key mental model: treat LLM observability as distributed tracing for AI—every request becomes a trace with nested spans capturing retrieval, reasoning, generation, and tool calls, with quality metrics evaluated at each step before responses reach users.

What This Cheat Sheet Covers

This topic spans 19 focused tables and 204 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Observability ConceptsTable 2: Performance MetricsTable 3: Cost TrackingTable 4: Quality MetricsTable 5: Tracing and DebuggingTable 6: Evaluation FrameworksTable 7: Observability Platforms and ToolsTable 8: Guardrails and Safety MonitoringTable 9: RAG-Specific ObservabilityTable 10: Agent ObservabilityTable 11: Error Handling and ReliabilityTable 12: Alerting and Anomaly DetectionTable 13: Streaming and Real-Time MonitoringTable 14: Caching and OptimizationTable 15: Model and Prompt ManagementTable 16: Compliance and GovernanceTable 17: Fine-Tuning and Training MetricsTable 18: Data Drift and Quality MonitoringTable 19: MCP Observability

Table 1: Core Observability Concepts

Before you can monitor an LLM app, you need the vocabulary the rest of the discipline is built on. These primitives—traces, spans, sessions, telemetry—let you treat a single user request as a structured, inspectable record rather than an opaque black box, and most are borrowed directly from distributed-tracing standards like OpenTelemetry and its GenAI conventions.

Concept	Example	Description
Trace	Complete execution path from user query through LLM calls to final response	End-to-end record of a request's journey through the system, capturing all operations as nested spans with timing and metadata.
Span	Single LLM call, vector search, or tool execution within a trace	• Individual unit of work within a trace • each span has a start time, duration, and attributes like model name or token count.
Session	`session_id: "user_123_conv_45"` groups multiple traces for one conversation	Collection of traces tied to a single user journey or conversation thread, enabling analysis of multi-turn interactions.
Metric	Token usage per request, p95 latency, cost per query	Quantitative measurement aggregated over time, such as throughput, latency percentiles, error rates, or token counts.
Log	`[INFO] User prompt: "Summarize quarterly earnings"`	Textual record of events with structured or unstructured data, including prompts, completions, and system messages.
Instrumentation	Adding OpenTelemetry SDK to capture LLM calls automatically	Code or framework integration that emits telemetry data from application code without manual logging for every operation.

Table 1: Core Observability Concepts

Concept	Example	Description
Trace	Complete execution path from user query through LLM calls to final response	End-to-end record of a request's journey through the system, capturing all operations as nested spans with timing and metadata.
Span	Single LLM call, vector search, or tool execution within a trace	• Individual unit of work within a trace • each span has a start time, duration, and attributes like model name or token count.
Session	`session_id: "user_123_conv_45"` groups multiple traces for one conversation	Collection of traces tied to a single user journey or conversation thread, enabling analysis of multi-turn interactions.
Metric	Token usage per request, p95 latency, cost per query	Quantitative measurement aggregated over time, such as throughput, latency percentiles, error rates, or token counts.
Log	`[INFO] User prompt: "Summarize quarterly earnings"`	Textual record of events with structured or unstructured data, including prompts, completions, and system messages.
Instrumentation	Adding OpenTelemetry SDK to capture LLM calls automatically	Code or framework integration that emits telemetry data from application code without manual logging for every operation.