AI evaluation frameworks and LLM benchmarking turn model behavior into measurable evidence across knowledge, reasoning, coding, retrieval, safety, cost, and production reliability. Practitioners use them to compare models, catch regressions, design custom quality gates, and decide whether a system is ready for users. The key mental model is measurement triangulation: public benchmarks reveal broad capability, task-specific evals reveal product fit, and production telemetry reveals whether the system still works after real users, data drift, and prompt changes enter the loop.
What This Cheat Sheet Covers
This topic spans 12 focused tables and 101 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Evaluation Paradigms and Benchmark Roles
Evaluation starts by choosing the right kind of evidence. Public benchmarks, offline test sets, human review, LLM judges, and production experiments answer different questions, so a reliable evaluation program usually combines several rather than treating one leaderboard as truth.
| Paradigm | Example | Description |
|---|---|---|
run HELM scenario -> compare model scores | Standardized tasks with fixed protocols for cross-model comparison. | |
dataset + candidate_prompt -> scored report | Repeatable pre-release testing on held-out examples. | |
old_score=0.84new_score=0.79 -> block | Detects quality drops after model, prompt, tool, or retrieval changes. | |
3 annotators rate helpfulness 1-5 | Expert or crowd judgments for subjective quality. |