LangSmith is a unified DevOps platform for developing, debugging, testing, deploying, and monitoring LLM applications and AI agents built by LangChain. It provides framework-agnostic observability with comprehensive tracing, evaluation datasets, online/offline evaluations, prompt management, human-in-the-loop workflows, and production monitoring to help teams move models from prototype to production. In 2025β2026, LangSmith expanded significantly with the Insights Agent (automated production trace analysis), Multi-turn Evals (conversation-level evaluation), OpenTelemetry (OTEL) integration, LangSmith Fleet β a no-code platform for building and managing AI agent fleets β and LangSmith Sandboxes (secure microVM-based code execution for agents, private preview March 2026). LangSmith's core differentiator remains end-to-end visibility into agent execution via traces and runs, enabling developers to understand what happens inside complex multi-step LLM workflows through detailed logging, cost tracking, latency metrics (P50/P99), and automated evaluations at scale.
What This Cheat Sheet Covers
This topic spans 20 focused tables and 131 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Concepts
| Concept | Example | Description |
|---|---|---|
Single request β LLM β retrieval β response Top-level execution unit | β’ End-to-end execution path capturing full request lifecycle β’ contains all runs/spans for a single request β’ analogous to spans in distributed tracing (OpenTelemetry) β’ includes input, output, metadata, timestamps, cost. | |
Individual LLM call, retriever step, or tool invocation within trace | β’ Individual step within a trace β’ similar to spans in OpenTelemetry β’ tracks single operation (LLM, chain, tool, retriever) β’ includes token counts, latency, cost β’ nested structure for complex workflows. | |
Collection of input-output pairs: {"input": "What is AI?", "expected": "..."}Versioned test cases | β’ Curated test cases for evaluation β’ contains example inputs and expected outputs β’ supports CSV, JSON, JSONL formats and file attachments (images, PDFs, audio, video) β’ versioned β new version on every change. | |
Run application on dataset β compare v1 vs v2 prompts | β’ Evaluation run on a dataset producing scores and metrics (accuracy, latency, cost) β’ supports comparison view for A/B testing β’ baseline pinning for regression detection. |