Large Language Model (LLM) evaluation is the systematic process of assessing model performance across multiple dimensions — from factual accuracy and reasoning to safety, bias, and production efficiency. The field has rapidly expanded from simple benchmark scoring to encompass agentic evaluation (multi-step planning and tool use), multi-turn conversational testing, and real-time production monitoring. The core challenge is that no single metric captures usefulness, trustworthiness, and production-readiness simultaneously; only a layered approach combining automated benchmarks, LLM-as-judge, and human review delivers reliable signal. A 37% gap between lab benchmark scores and real-world deployment performance — documented in enterprise AI studies — is the most important reason to treat evaluation as a continuous discipline, not a one-time exercise.
What This Cheat Sheet Covers
This topic spans 26 focused tables and 119 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Evaluation Paradigms
Before picking any specific benchmark or metric, it helps to know the broad families of evaluation and what each trades off. Benchmarks give reproducibility but age into contamination; LLM-as-judge scales cheaply but inherits the judge's biases; human review is the gold standard but slow and costly. Most serious evaluation layers several of these together precisely because no single one is enough.
| Paradigm | Example | Description |
|---|---|---|
MMLU: 14,042 multiple-choice questions across 57 subjects | • Standardized test suites with fixed questions and scoring rules • enables reproducible comparison across models but risks data contamination as benchmarks age. | |
GPT-4 evaluates outputs on 1-5 scale for helpfulness | • Uses a powerful LLM to score another model's outputs • scales cheaply but inherits judge model biases and may favor outputs stylistically similar to the judge. | |
Crowdsourced annotators rate two responses, choose winner | • Gold-standard providing ground truth judgment on subjective qualities • expensive, slow, and faces annotator consistency challenges. |