Hallucinations in large language models are confident but factually incorrect, nonsensical, or ungrounded responses—a fundamental challenge that emerges from the probabilistic nature of token-by-token prediction in transformer architectures. Preventing hallucinations requires grounding outputs in verifiable sources, constraining generation behavior, and implementing multi-layered verification rather than relying solely on the model's training. Modern RAG pipelines have advanced substantially, with GraphRAG, speculative RAG, active retrieval (FLARE), and multimodal grounding significantly extending what's achievable in 2026. The key insight: effective hallucination prevention is an orchestration problem, combining prompt design, retrieval mechanisms, sampling strategies, and post-generation validation into a coherent system where each layer compensates for the others' weaknesses.
What This Cheat Sheet Covers
This topic spans 12 focused tables and 98 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Grounding Techniques
Grounding keeps model outputs anchored to real, retrievable evidence rather than parametric knowledge alone. Choosing the right retrieval architecture — basic RAG, hybrid, graph-based, or active — is the single highest-leverage decision in any hallucination-reduction system.
| Technique | Example | Description |
|---|---|---|
query = "Tesla revenue 2025"docs = retriever.search(query)prompt = f"Based on: {docs}, answer: {query}" | • Retrieves relevant documents from external knowledge base before generation, anchoring responses in retrieved facts • reduces hallucination rate by over 40% when properly implemented. | |
results_kw = bm25.search(query, top_k=20)results_vec = vector_db.search(embed(query), top_k=20)fused = reciprocal_rank_fusion(results_kw, results_vec) | • Combines sparse keyword (BM25) and dense semantic search via Reciprocal Rank Fusion (RRF) • BM25 catches exact terms and IDs that embeddings miss; dense search catches paraphrases keyword search misses. | |
candidates = retrieve_top_k(query, k=50)reranked = cross_encoder.score(query, candidates)context = reranked[:5] | • Uses a cross-encoder model to re-score retrieved candidates against the query, selecting highest-precision passages • narrows a large recall pool to the most relevant context before generation. | |
# Model decides when to retrieveoutput = self_rag_model.generate(prompt)# Reflection tokens: [Retrieve], [IsREL], [IsSUP] | • Fine-tuned LLM decides on-demand whether to retrieve and critiques its own output via learned reflection tokens • outperforms static RAG by skipping retrieval when unnecessary and self-verifying generated claims. | |
score = retrieval_evaluator(query, docs)if score < threshold: docs = web_search(query)answer = llm(query, docs) | • Lightweight retrieval evaluator scores retrieved documents; if confidence is low, falls back to web search • applies a decompose-then-recompose algorithm to strip noise before passing context to the LLM. | |
tokens, probs = llm.generate_with_probs(prompt)if min(probs) < threshold: new_docs = retrieve(form_query(tokens)) regenerate(prompt, new_docs) | • Forward-Looking Active Retrieval — iteratively predicts upcoming tokens and triggers new retrieval when low-confidence tokens are detected during generation • unlike static RAG, retrieves multiple times throughout long-form generation only when needed. | |
chunk_with_context = add_document_context(chunk)embed_and_store(chunk_with_context)retrieve_with_full_context(query) | • Prepends document-level context to each chunk before embedding, improving retrieval accuracy by ensuring chunks retain meaning when retrieved in isolation • reduces loss of context from chunking. | |
small_chunks = split(doc, size=100)for chunk in small_chunks: store_child(chunk, parent_id=doc.id)# Retrieve child, return parent | • Embeds small child chunks for high-precision retrieval, but returns the larger parent document for richer context • combines surgical recall precision with the broad context the LLM needs to avoid fabrication. | |
variants = llm(f"Generate 5 rephrasings of: {query}")results = [retriever.search(v) for v in variants]context = deduplicate(results) | • Generates multiple semantically equivalent query variants to increase retrieval recall, bridging wording gaps between user question and document vocabulary • especially effective for vague queries in BM25/keyword search. |