Retrieval-Augmented Generation (RAG) evaluation is the systematic process of measuring how effectively a RAG system retrieves relevant context from a knowledge base and generates accurate, grounded responses. Unlike traditional LLM evaluation, RAG assessment requires evaluating both retrieval quality (did we find the right documents?) and generation quality (is the answer accurate and faithful to the retrieved context?). The core challenge lies in the dual nature of failure modes—answers can be wrong because retrieval missed key information, because the generator hallucinated despite good context, or because the context itself was noisy, irrelevant, or stale. A comprehensive evaluation strategy extends beyond the standard four metrics (faithfulness, answer relevance, context precision, context recall) to include context trustworthiness, agentic task completion, multi-turn conversation quality, and security robustness—often implemented via automated metrics, LLM-as-a-judge techniques, CI/CD quality gates, and production runtime guardrails.
What This Cheat Sheet Covers
This topic spans 9 focused tables and 98 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Retrieval Metrics
| Metric | Example | Description |
|---|---|---|
retrieved = [d1, d2, d3]hit_rate = 1 if relevant_doc in retrieved else 0 | • Measures whether at least one relevant document appears in the top-K retrieved results • binary metric (0 or 1) that answers "did we find anything useful?" | |
precision = relevant_in_topk / k# e.g., 3/5 = 0.6 | • Fraction of retrieved documents that are actually relevant • focuses on minimizing noise in the top-K results without considering ranking order | |
recall = relevant_in_topk / total_relevant# e.g., 3/10 = 0.3 | • Fraction of all relevant documents that appear in the top-K results • measures coverage of the relevant set without penalizing rank position | |
f1 = 2 * (prec * rec) / (prec + rec)# P@5=0.6, R@5=0.3 → F1≈0.4 | • Harmonic mean of Precision@K and Recall@K • provides a single balanced score when both relevance and coverage matter equally | |
mrr = 1 / rank_of_first_relevant# rank=2 → 0.5 | • Measures position of the first relevant document • rewards systems that place the correct result highest; ranges 0 to 1 | |
map = mean([precision for i in relevant_positions])# (1/1 + 2/3) / 2 | • Averages precision values at each relevant document position • rewards systems that rank all relevant items high across the full retrieved set |