AI Evaluation Frameworks and LLM Benchmarking Cheat Sheet

Updated 2026-05-19

Next Topic: AI for Scientific Discovery Cheat Sheet

AI evaluation frameworks and LLM benchmarking turn model behavior into measurable evidence across knowledge, reasoning, coding, retrieval, safety, cost, and production reliability. Practitioners use them to compare models, catch regressions, design custom quality gates, and decide whether a system is ready for users. The key mental model is measurement triangulation: public benchmarks reveal broad capability, task-specific evals reveal product fit, and production telemetry reveals whether the system still works after real users, data drift, and prompt changes enter the loop.

What This Cheat Sheet Covers

This topic spans 12 focused tables and 101 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Evaluation Paradigms and Benchmark RolesTable 2: General Capability BenchmarksTable 3: Coding and Agent BenchmarksTable 4: RAGAS and RAG MetricsTable 5: Evaluation Frameworks and ToolsTable 6: LLM-as-a-Judge PatternsTable 7: Pairwise Ranking and Preference MethodsTable 8: Custom Metric and Rubric DesignTable 9: Evaluation Dataset ConstructionTable 10: Contamination and Overfitting RisksTable 11: Human, Blind, Automated, and Hybrid EvaluationTable 12: CI, Release Gates, and Production Monitoring

Table 1: Evaluation Paradigms and Benchmark Roles

Evaluation starts by choosing the right kind of evidence. Public benchmarks, offline test sets, human review, LLM judges, and production experiments answer different questions, so a reliable evaluation program usually combines several rather than treating one leaderboard as truth.

Paradigm	Example	Description
Benchmark Evaluation	`run HELM scenario -> compare model scores`	Standardized tasks with fixed protocols for cross-model comparison.
Offline Evaluation	`dataset + candidate_prompt -> scored report`	Repeatable pre-release testing on held-out examples.
Regression Evaluation	`old_score=0.84` `new_score=0.79 -> block`	Detects quality drops after model, prompt, tool, or retrieval changes.
Human Evaluation	`3 annotators rate helpfulness 1-5`	Expert or crowd judgments for subjective quality.

Table 1: Evaluation Paradigms and Benchmark Roles

Paradigm	Example	Description
Benchmark Evaluation	`run HELM scenario -> compare model scores`	Standardized tasks with fixed protocols for cross-model comparison.
Offline Evaluation	`dataset + candidate_prompt -> scored report`	Repeatable pre-release testing on held-out examples.
Regression Evaluation	`old_score=0.84` `new_score=0.79 -> block`	Detects quality drops after model, prompt, tool, or retrieval changes.
Human Evaluation	`3 annotators rate helpfulness 1-5`	Expert or crowd judgments for subjective quality.