LLM Evaluation Cheat Sheet

Updated 2026-04-28

Large Language Model (LLM) evaluation is the systematic process of assessing model performance across multiple dimensions — from factual accuracy and reasoning to safety, bias, and production efficiency. The field has rapidly expanded from simple benchmark scoring to encompass agentic evaluation (multi-step planning and tool use), multi-turn conversational testing, and real-time production monitoring. The core challenge is that no single metric captures usefulness, trustworthiness, and production-readiness simultaneously; only a layered approach combining automated benchmarks, LLM-as-judge, and human review delivers reliable signal. A 37% gap between lab benchmark scores and real-world deployment performance — documented in enterprise AI studies — is the most important reason to treat evaluation as a continuous discipline, not a one-time exercise.

What This Cheat Sheet Covers

This topic spans 26 focused tables and 119 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Evaluation ParadigmsTable 2: Major Benchmark SuitesTable 3: Code Generation BenchmarksTable 4: Mathematics Reasoning BenchmarksTable 5: Multimodal Evaluation BenchmarksTable 6: Language Understanding BenchmarksTable 7: Safety and Alignment BenchmarksTable 8: Conversational and Chat EvaluationTable 9: Multi-Turn Evaluation MetricsTable 10: Contamination-Resistant BenchmarksTable 11: Automated Evaluation Metrics (Reference-based)Table 12: Automated Evaluation Metrics (Reference-free)Table 13: Code Evaluation MetricsTable 14: RAG-Specific MetricsTable 15: LLM-as-a-Judge Evaluation MethodsTable 16: Human Evaluation MethodsTable 17: Bias and Fairness MetricsTable 18: Safety Evaluation MetricsTable 19: Production Performance MetricsTable 20: Calibration and Uncertainty MetricsTable 21: Retrieval Metrics (for RAG)Table 22: Contamination Detection MethodsTable 23: Adversarial Robustness EvaluationTable 24: Instruction Following MetricsTable 25: Agent Evaluation BenchmarksTable 26: Emerging Evaluation Approaches

Table 1: Core Evaluation Paradigms

Paradigm	Example	Description
Benchmark-based	`MMLU: 14,042 multiple-choice questions across 57 subjects`	• Standardized test suites with fixed questions and scoring rules • enables reproducible comparison across models but risks data contamination as benchmarks age.
LLM-as-a-Judge	`GPT-4 evaluates outputs on 1-5 scale for helpfulness`	• Uses a powerful LLM to score another model's outputs • scales cheaply but inherits judge model biases and may favor outputs stylistically similar to the judge.
Human evaluation	`Crowdsourced annotators rate two responses, choose winner`	• Gold-standard providing ground truth judgment on subjective qualities • expensive, slow, and faces annotator consistency challenges.

Table 1: Core Evaluation Paradigms

Paradigm	Example	Description
Benchmark-based	`MMLU: 14,042 multiple-choice questions across 57 subjects`	• Standardized test suites with fixed questions and scoring rules • enables reproducible comparison across models but risks data contamination as benchmarks age.
LLM-as-a-Judge	`GPT-4 evaluates outputs on 1-5 scale for helpfulness`	• Uses a powerful LLM to score another model's outputs • scales cheaply but inherits judge model biases and may favor outputs stylistically similar to the judge.
Human evaluation	`Crowdsourced annotators rate two responses, choose winner`	• Gold-standard providing ground truth judgment on subjective qualities • expensive, slow, and faces annotator consistency challenges.