LangSmith is a unified DevOps platform for developing, debugging, testing, deploying, and monitoring LLM applications and AI agents built by LangChain. It provides framework-agnostic observability with comprehensive tracing, evaluation datasets, online/offline evaluations, prompt management, human-in-the-loop workflows, and production monitoring to help teams move from prototype to production. In 2026, LangSmith expanded significantly with LangSmith Engine (autonomous failure clustering + PR proposals), SmithDB (purpose-built Rust/DataFusion database, up to 15x faster), Context Hub (versioned agent context management), LLM Gateway (runtime spend limits + PII redaction), and Sandboxes GA (hardware-virtualized microVMs for safe agent code execution). LangSmith's core differentiator remains end-to-end visibility into agent execution via traces, enabling developers to understand, evaluate, and continuously improve complex multi-step LLM and agent workflows.
What This Cheat Sheet Covers
This topic spans 22 focused tables and 163 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Concepts
The vocabulary of LangSmith — understanding these building blocks is prerequisite for everything else in the platform. The trace → run → feedback hierarchy maps directly to how observability data flows into evaluations, annotation queues, and Engine-driven improvements.
| Concept | Example | Description |
|---|---|---|
Single request → LLM → retrieval → response Top-level execution unit | • End-to-end execution path capturing full request lifecycle • contains all runs/spans for a single request • analogous to spans in distributed tracing (OpenTelemetry) • includes input, output, metadata, timestamps, cost | |
Individual LLM call, retriever step, or tool invocation within trace | • Individual step within a trace, similar to OpenTelemetry spans • tracks single operation (LLM, chain, tool, retriever) • includes token counts, latency, cost; nested for complex workflows. | |
{"input": "What is AI?", "expected": "..."}Versioned test-case collection | • Curated test cases for evaluation • supports CSV, JSON, JSONL + file attachments (images, PDFs, audio, video) • versioned — new version on every change; pin experiments to a specific version. | |
Run application on dataset → compare v1 vs v2 prompts | • Evaluation run on a dataset producing scores and metrics (accuracy, latency, cost) • supports comparison view for A/B testing and baseline pinning for regression detection | |
LLM-as-judge, code-based, or human reviewer Scores outputs on criteria | • Scoring function for evaluation • types: LLM-as-judge, code-based, human (annotation queues), composite (weighted multi-score) • applied to experiments or online runs; now reusable across projects. |