Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications

Categories

🎓 Certifications
🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
CheatGrid
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications
LVLEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

LLM Evaluation Cheat Sheet

LLM Evaluation Cheat Sheet

Back to Generative AI
Updated 2026-04-28
Next Topic: LLM Fine-tuning Cheat Sheet

Large Language Model (LLM) evaluation is the systematic process of assessing model performance across multiple dimensions — from factual accuracy and reasoning to safety, bias, and production efficiency. The field has rapidly expanded from simple benchmark scoring to encompass agentic evaluation (multi-step planning and tool use), multi-turn conversational testing, and real-time production monitoring. The core challenge is that no single metric captures usefulness, trustworthiness, and production-readiness simultaneously; only a layered approach combining automated benchmarks, LLM-as-judge, and human review delivers reliable signal. A 37% gap between lab benchmark scores and real-world deployment performance — documented in enterprise AI studies — is the most important reason to treat evaluation as a continuous discipline, not a one-time exercise.

What This Cheat Sheet Covers

This topic spans 26 focused tables and 119 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Evaluation ParadigmsTable 2: Major Benchmark SuitesTable 3: Code Generation BenchmarksTable 4: Mathematics Reasoning BenchmarksTable 5: Multimodal Evaluation BenchmarksTable 6: Language Understanding BenchmarksTable 7: Safety and Alignment BenchmarksTable 8: Conversational and Chat EvaluationTable 9: Multi-Turn Evaluation MetricsTable 10: Contamination-Resistant BenchmarksTable 11: Automated Evaluation Metrics (Reference-based)Table 12: Automated Evaluation Metrics (Reference-free)Table 13: Code Evaluation MetricsTable 14: RAG-Specific MetricsTable 15: LLM-as-a-Judge Evaluation MethodsTable 16: Human Evaluation MethodsTable 17: Bias and Fairness MetricsTable 18: Safety Evaluation MetricsTable 19: Production Performance MetricsTable 20: Calibration and Uncertainty MetricsTable 21: Retrieval Metrics (for RAG)Table 22: Contamination Detection MethodsTable 23: Adversarial Robustness EvaluationTable 24: Instruction Following MetricsTable 25: Agent Evaluation BenchmarksTable 26: Emerging Evaluation Approaches

Table 1: Core Evaluation Paradigms

Before picking any specific benchmark or metric, it helps to know the broad families of evaluation and what each trades off. Benchmarks give reproducibility but age into contamination; LLM-as-judge scales cheaply but inherits the judge's biases; human review is the gold standard but slow and costly. Most serious evaluation layers several of these together precisely because no single one is enough.

ParadigmExampleDescription
Benchmark-based
MMLU: 14,042 multiple-choice questions across 57 subjects
• Standardized test suites with fixed questions and scoring rules
• enables reproducible comparison across models but risks data contamination as benchmarks age.
LLM-as-a-Judge
GPT-4 evaluates outputs on 1-5 scale for helpfulness
• Uses a powerful LLM to score another model's outputs
• scales cheaply but inherits judge model biases and may favor outputs stylistically similar to the judge.
Human evaluation
Crowdsourced annotators rate two responses, choose winner
• Gold-standard providing ground truth judgment on subjective qualities
• expensive, slow, and faces annotator consistency challenges.

More in Generative AI

  • LLM APIs and Integration Cheat Sheet
  • LLM Fine-tuning Cheat Sheet
  • Advanced RAG Patterns and Optimization Cheat Sheet
  • ColBERT and Late Interaction Retrieval Cheat Sheet
  • LangSmith Cheat Sheet
  • pgvector for Postgres Vector Search Cheat Sheet
View all 95 topics in Generative AI