Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

LLM Evaluation Cheat Sheet

LLM Evaluation Cheat Sheet

Back to Generative AI
Updated 2026-04-28
Next Topic: LLM Fine-tuning Cheat Sheet

Large Language Model (LLM) evaluation is the systematic process of assessing model performance across multiple dimensions — from factual accuracy and reasoning to safety, bias, and production efficiency. The field has rapidly expanded from simple benchmark scoring to encompass agentic evaluation (multi-step planning and tool use), multi-turn conversational testing, and real-time production monitoring. The core challenge is that no single metric captures usefulness, trustworthiness, and production-readiness simultaneously; only a layered approach combining automated benchmarks, LLM-as-judge, and human review delivers reliable signal. A 37% gap between lab benchmark scores and real-world deployment performance — documented in enterprise AI studies — is the most important reason to treat evaluation as a continuous discipline, not a one-time exercise.

What This Cheat Sheet Covers

This topic spans 26 focused tables and 119 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Evaluation ParadigmsTable 2: Major Benchmark SuitesTable 3: Code Generation BenchmarksTable 4: Mathematics Reasoning BenchmarksTable 5: Multimodal Evaluation BenchmarksTable 6: Language Understanding BenchmarksTable 7: Safety and Alignment BenchmarksTable 8: Conversational and Chat EvaluationTable 9: Multi-Turn Evaluation MetricsTable 10: Contamination-Resistant BenchmarksTable 11: Automated Evaluation Metrics (Reference-based)Table 12: Automated Evaluation Metrics (Reference-free)Table 13: Code Evaluation MetricsTable 14: RAG-Specific MetricsTable 15: LLM-as-a-Judge Evaluation MethodsTable 16: Human Evaluation MethodsTable 17: Bias and Fairness MetricsTable 18: Safety Evaluation MetricsTable 19: Production Performance MetricsTable 20: Calibration and Uncertainty MetricsTable 21: Retrieval Metrics (for RAG)Table 22: Contamination Detection MethodsTable 23: Adversarial Robustness EvaluationTable 24: Instruction Following MetricsTable 25: Agent Evaluation BenchmarksTable 26: Emerging Evaluation Approaches

Table 1: Core Evaluation Paradigms

ParadigmExampleDescription
Benchmark-based
MMLU: 14,042 multiple-choice questions across 57 subjects
• Standardized test suites with fixed questions and scoring rules
• enables reproducible comparison across models but risks data contamination as benchmarks age.
LLM-as-a-Judge
GPT-4 evaluates outputs on 1-5 scale for helpfulness
• Uses a powerful LLM to score another model's outputs
• scales cheaply but inherits judge model biases and may favor outputs stylistically similar to the judge.
Human evaluation
Crowdsourced annotators rate two responses, choose winner
• Gold-standard providing ground truth judgment on subjective qualities
• expensive, slow, and faces annotator consistency challenges.

More in Generative AI

  • LLM APIs and Integration Cheat Sheet
  • LLM Fine-tuning Cheat Sheet
  • Advanced RAG Patterns and Optimization Cheat Sheet
  • Chain-of-Thought Reasoning Cheat Sheet
  • Knowledge Distillation Cheat Sheet
  • Multimodal AI Cheat Sheet
View all 77 topics in Generative AI