Large Language Model (LLM) evaluation is the systematic process of assessing model performance across multiple dimensions — from factual accuracy and reasoning to safety, bias, and production efficiency. The field has rapidly expanded from simple benchmark scoring to encompass agentic evaluation (multi-step planning and tool use), multi-turn conversational testing, and real-time production monitoring. The core challenge is that no single metric captures usefulness, trustworthiness, and production-readiness simultaneously; only a layered approach combining automated benchmarks, LLM-as-judge, and human review delivers reliable signal. A 37% gap between lab benchmark scores and real-world deployment performance — documented in enterprise AI studies — is the most important reason to treat evaluation as a continuous discipline, not a one-time exercise.
What This Cheat Sheet Covers
This topic spans 26 focused tables and 119 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Evaluation Paradigms
| Paradigm | Example | Description |
|---|---|---|
MMLU: 14,042 multiple-choice questions across 57 subjects | • Standardized test suites with fixed questions and scoring rules • enables reproducible comparison across models but risks data contamination as benchmarks age. | |
GPT-4 evaluates outputs on 1-5 scale for helpfulness | • Uses a powerful LLM to score another model's outputs • scales cheaply but inherits judge model biases and may favor outputs stylistically similar to the judge. | |
Crowdsourced annotators rate two responses, choose winner | • Gold-standard providing ground truth judgment on subjective qualities • expensive, slow, and faces annotator consistency challenges. |