RAG Evaluation Cheat Sheet

Updated 2026-04-28

Retrieval-Augmented Generation (RAG) evaluation is the systematic process of measuring how effectively a RAG system retrieves relevant context from a knowledge base and generates accurate, grounded responses. Unlike traditional LLM evaluation, RAG assessment requires evaluating both retrieval quality (did we find the right documents?) and generation quality (is the answer accurate and faithful to the retrieved context?). The core challenge lies in the dual nature of failure modes—answers can be wrong because retrieval missed key information, because the generator hallucinated despite good context, or because the context itself was noisy, irrelevant, or stale. A comprehensive evaluation strategy extends beyond the standard four metrics (faithfulness, answer relevance, context precision, context recall) to include context trustworthiness, agentic task completion, multi-turn conversation quality, and security robustness—often implemented via automated metrics, LLM-as-a-judge techniques, CI/CD quality gates, and production runtime guardrails.

What This Cheat Sheet Covers

This topic spans 9 focused tables and 98 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Retrieval MetricsTable 2: Generation Quality MetricsTable 3: End-to-End RAG MetricsTable 4: LLM-as-a-Judge EvaluationTable 5: Evaluation Frameworks and ToolsTable 6: Testing and Dataset StrategiesTable 7: Production Monitoring and ObservabilityTable 8: Advanced Evaluation TechniquesTable 9: Evaluation Benchmarks and Datasets

Table 1: Core Retrieval Metrics

Before you can judge an answer, you have to judge whether the right documents were even fetched — that's what these metrics measure. They split into two camps: rank-unaware scores like Recall@K and Precision@K that ask "did we find the relevant chunks?", and rank-aware ones like MRR, MAP, and NDCG that reward putting the best result at the top. The context-prefixed entries (precision, recall, relevance, sufficiency) are the RAG-specific versions used by frameworks like RAGAS, several of which need a ground-truth answer to score against.

Metric	Example	Description
Hit Rate (Recall@K)	`retrieved = [d1, d2, d3]` `hit_rate = 1 if relevant_doc in retrieved else 0`	• Measures whether at least one relevant document appears in the top-K retrieved results • binary metric (0 or 1) that answers "did we find anything useful?"
Precision@K	`precision = relevant_in_topk / k` `# e.g., 3/5 = 0.6`	• Fraction of retrieved documents that are actually relevant • focuses on minimizing noise in the top-K results without considering ranking order
Recall@K	`recall = relevant_in_topk / total_relevant` `# e.g., 3/10 = 0.3`	• Fraction of all relevant documents that appear in the top-K results • measures coverage of the relevant set without penalizing rank position
F1@K	`f1 = 2 * (prec * rec) / (prec + rec)` `# P@5=0.6, R@5=0.3 → F1≈0.4`	• Harmonic mean of Precision@K and Recall@K • provides a single balanced score when both relevance and coverage matter equally
Mean Reciprocal Rank (MRR)	`mrr = 1 / rank_of_first_relevant` `# rank=2 → 0.5`	• Measures position of the first relevant document • rewards systems that place the correct result highest; ranges 0 to 1
Mean Average Precision (MAP)	`map = mean([precision@i for i in relevant_positions])` `# (1/1 + 2/3) / 2`	• Averages precision values at each relevant document position • rewards systems that rank all relevant items high across the full retrieved set

Table 1: Core Retrieval Metrics

Metric	Example	Description
Hit Rate (Recall@K)	`retrieved = [d1, d2, d3]` `hit_rate = 1 if relevant_doc in retrieved else 0`	• Measures whether at least one relevant document appears in the top-K retrieved results • binary metric (0 or 1) that answers "did we find anything useful?"
Precision@K	`precision = relevant_in_topk / k` `# e.g., 3/5 = 0.6`	• Fraction of retrieved documents that are actually relevant • focuses on minimizing noise in the top-K results without considering ranking order
Recall@K	`recall = relevant_in_topk / total_relevant` `# e.g., 3/10 = 0.3`	• Fraction of all relevant documents that appear in the top-K results • measures coverage of the relevant set without penalizing rank position
F1@K	`f1 = 2 * (prec * rec) / (prec + rec)` `# P@5=0.6, R@5=0.3 → F1≈0.4`	• Harmonic mean of Precision@K and Recall@K • provides a single balanced score when both relevance and coverage matter equally
Mean Reciprocal Rank (MRR)	`mrr = 1 / rank_of_first_relevant` `# rank=2 → 0.5`	• Measures position of the first relevant document • rewards systems that place the correct result highest; ranges 0 to 1
Mean Average Precision (MAP)	`map = mean([precision@i for i in relevant_positions])` `# (1/1 + 2/3) / 2`	• Averages precision values at each relevant document position • rewards systems that rank all relevant items high across the full retrieved set