Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

RAG Evaluation Cheat Sheet

RAG Evaluation Cheat Sheet

Back to Generative AI
Updated 2026-04-28
Next Topic: Semantic Search Cheat Sheet

Retrieval-Augmented Generation (RAG) evaluation is the systematic process of measuring how effectively a RAG system retrieves relevant context from a knowledge base and generates accurate, grounded responses. Unlike traditional LLM evaluation, RAG assessment requires evaluating both retrieval quality (did we find the right documents?) and generation quality (is the answer accurate and faithful to the retrieved context?). The core challenge lies in the dual nature of failure modes—answers can be wrong because retrieval missed key information, because the generator hallucinated despite good context, or because the context itself was noisy, irrelevant, or stale. A comprehensive evaluation strategy extends beyond the standard four metrics (faithfulness, answer relevance, context precision, context recall) to include context trustworthiness, agentic task completion, multi-turn conversation quality, and security robustness—often implemented via automated metrics, LLM-as-a-judge techniques, CI/CD quality gates, and production runtime guardrails.

What This Cheat Sheet Covers

This topic spans 9 focused tables and 98 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Retrieval MetricsTable 2: Generation Quality MetricsTable 3: End-to-End RAG MetricsTable 4: LLM-as-a-Judge EvaluationTable 5: Evaluation Frameworks and ToolsTable 6: Testing and Dataset StrategiesTable 7: Production Monitoring and ObservabilityTable 8: Advanced Evaluation TechniquesTable 9: Evaluation Benchmarks and Datasets

Table 1: Core Retrieval Metrics

MetricExampleDescription
Hit Rate (Recall@K)
retrieved = [d1, d2, d3]
hit_rate = 1 if relevant_doc in retrieved else 0
• Measures whether at least one relevant document appears in the top-K retrieved results
• binary metric (0 or 1) that answers "did we find anything useful?"
Precision@K
precision = relevant_in_topk / k
# e.g., 3/5 = 0.6
• Fraction of retrieved documents that are actually relevant
• focuses on minimizing noise in the top-K results without considering ranking order
Recall@K
recall = relevant_in_topk / total_relevant
# e.g., 3/10 = 0.3
• Fraction of all relevant documents that appear in the top-K results
• measures coverage of the relevant set without penalizing rank position
F1@K
f1 = 2 * (prec * rec) / (prec + rec)
# P@5=0.6, R@5=0.3 → F1≈0.4
• Harmonic mean of Precision@K and Recall@K
• provides a single balanced score when both relevance and coverage matter equally
Mean Reciprocal Rank (MRR)
mrr = 1 / rank_of_first_relevant
# rank=2 → 0.5
• Measures position of the first relevant document
• rewards systems that place the correct result highest; ranges 0 to 1
Mean Average Precision (MAP)
map = mean([precision@i for i in relevant_positions])
# (1/1 + 2/3) / 2
• Averages precision values at each relevant document position
• rewards systems that rank all relevant items high across the full retrieved set

More in Generative AI

  • RAG (Retrieval Augmented Generation) Cheat Sheet
  • Semantic Search Cheat Sheet
  • Advanced RAG Patterns and Optimization Cheat Sheet
  • Chain-of-Thought Reasoning Cheat Sheet
  • Knowledge Distillation Cheat Sheet
  • MCP Servers Implementation Cheat Sheet
View all 77 topics in Generative AI