LangSmith Cheat Sheet

Updated 2026-05-28

Next Topic: Large Language Models (LLMs) Cheat Sheet

LangSmith is a unified DevOps platform for developing, debugging, testing, deploying, and monitoring LLM applications and AI agents built by LangChain. It provides framework-agnostic observability with comprehensive tracing, evaluation datasets, online/offline evaluations, prompt management, human-in-the-loop workflows, and production monitoring to help teams move from prototype to production. In 2026, LangSmith expanded significantly with LangSmith Engine (autonomous failure clustering + PR proposals), SmithDB (purpose-built Rust/DataFusion database, up to 15x faster), Context Hub (versioned agent context management), LLM Gateway (runtime spend limits + PII redaction), and Sandboxes GA (hardware-virtualized microVMs for safe agent code execution). LangSmith's core differentiator remains end-to-end visibility into agent execution via traces, enabling developers to understand, evaluate, and continuously improve complex multi-step LLM and agent workflows.

What This Cheat Sheet Covers

This topic spans 22 focused tables and 163 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core ConceptsTable 2: Tracing and ObservabilityTable 3: Datasets and ExamplesTable 4: Offline EvaluationsTable 5: Online Evaluations and MonitoringTable 6: LangSmith EngineTable 7: Human Annotation WorkflowsTable 8: Prompt EngineeringTable 9: Context HubTable 10: Python SDKTable 11: JavaScript/TypeScript SDKTable 12: OpenTelemetry (OTEL) IntegrationTable 13: Framework IntegrationsTable 14: REST APITable 15: Production DeploymentTable 16: LangSmith Fleet (Agent Platform)Table 17: Self-Hosted LangSmithTable 18: Pricing and PlansTable 19: Comparison with AlternativesTable 20: Best PracticesTable 21: Common Use CasesTable 22: Advanced Features

Table 1: Core Concepts

The vocabulary of LangSmith — understanding these building blocks is prerequisite for everything else in the platform. The trace → run → feedback hierarchy maps directly to how observability data flows into evaluations, annotation queues, and Engine-driven improvements.

Concept	Example	Description
Trace	Single request → LLM → retrieval → response Top-level execution unit	• End-to-end execution path capturing full request lifecycle • contains all runs/spans for a single request • analogous to spans in distributed tracing (OpenTelemetry) • includes input, output, metadata, timestamps, cost
Run (Span)	Individual LLM call, retriever step, or tool invocation within trace	• Individual step within a trace, similar to OpenTelemetry spans • tracks single operation (LLM, chain, tool, retriever) • includes token counts, latency, cost; nested for complex workflows.
Dataset	`{"input": "What is AI?", "expected": "..."}` Versioned test-case collection	• Curated test cases for evaluation • supports CSV, JSON, JSONL + file attachments (images, PDFs, audio, video) • versioned — new version on every change; pin experiments to a specific version.
Experiment	Run application on dataset → compare v1 vs v2 prompts	• Evaluation run on a dataset producing scores and metrics (accuracy, latency, cost) • supports comparison view for A/B testing and baseline pinning for regression detection
Evaluator	LLM-as-judge, code-based, or human reviewer Scores outputs on criteria	• Scoring function for evaluation • types: LLM-as-judge, code-based, human (annotation queues), composite (weighted multi-score) • applied to experiments or online runs; now reusable across projects.

Table 1: Core Concepts

Concept	Example	Description
Trace	Single request → LLM → retrieval → response Top-level execution unit	• End-to-end execution path capturing full request lifecycle • contains all runs/spans for a single request • analogous to spans in distributed tracing (OpenTelemetry) • includes input, output, metadata, timestamps, cost
Run (Span)	Individual LLM call, retriever step, or tool invocation within trace	• Individual step within a trace, similar to OpenTelemetry spans • tracks single operation (LLM, chain, tool, retriever) • includes token counts, latency, cost; nested for complex workflows.
Dataset	`{"input": "What is AI?", "expected": "..."}` Versioned test-case collection	• Curated test cases for evaluation • supports CSV, JSON, JSONL + file attachments (images, PDFs, audio, video) • versioned — new version on every change; pin experiments to a specific version.
Experiment	Run application on dataset → compare v1 vs v2 prompts	• Evaluation run on a dataset producing scores and metrics (accuracy, latency, cost) • supports comparison view for A/B testing and baseline pinning for regression detection
Evaluator	LLM-as-judge, code-based, or human reviewer Scores outputs on criteria	• Scoring function for evaluation • types: LLM-as-judge, code-based, human (annotation queues), composite (weighted multi-score) • applied to experiments or online runs; now reusable across projects.