Hallucinations in large language models are confident but factually incorrect, nonsensical, or ungrounded responses—a fundamental challenge that emerges from the probabilistic nature of token-by-token prediction in transformer architectures. Preventing hallucinations requires grounding outputs in verifiable sources, constraining generation behavior, and implementing multi-layered verification rather than relying solely on the model's training. Modern RAG pipelines have advanced substantially, with hybrid retrieval, reranking, and self-reflective retrieval strategies significantly improving factual grounding. The key insight: effective hallucination prevention is an orchestration problem, combining prompt design, retrieval mechanisms, sampling strategies, and post-generation validation into a coherent system where each layer compensates for the others' weaknesses.
What This Cheat Sheet Covers
This topic spans 12 focused tables and 85 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Grounding Techniques
| Technique | Example | Description |
|---|---|---|
query = "Tesla revenue 2025"docs = retriever.search(query)prompt = f"Based on: {docs}, answer: {query}" | • Retrieves relevant documents from external knowledge base before generation, anchoring responses in retrieved facts rather than parametric knowledge alone • reduces hallucination rate by over 40% when properly implemented. | |
results_kw = bm25.search(query, top_k=20)results_vec = vector_db.search(embed(query), top_k=20)fused = reciprocal_rank_fusion(results_kw, results_vec) | • Combines sparse keyword (BM25) and dense semantic search via Reciprocal Rank Fusion • keyword search catches exact terms and IDs that embeddings miss, semantic search catches paraphrases keyword search misses. | |
candidates = retrieve_top_k(query, k=50)reranked = cross_encoder.score(query, candidates)context = reranked[:5] | • Uses a cross-encoder model to re-score retrieved candidates against the query, selecting highest-precision passages • narrows a large recall pool to the most relevant context before generation. | |
# Model decides when to retrieveoutput = self_rag_model.generate(prompt)# Reflection tokens: [Retrieve], [IsREL], [IsSUP] | • Fine-tuned LLM decides on-demand whether to retrieve documents and critiques its own output via learned reflection tokens • outperforms static RAG by skipping retrieval when unnecessary and self-verifying generated claims. | |
score = retrieval_evaluator(query, docs)if score < threshold: docs = web_search(query)answer = llm(query, docs) | • Lightweight retrieval evaluator scores retrieved documents • if confidence is low it falls back to web search and applies a decompose-then-recompose algorithm to strip noise before passing context to the LLM. | |
response = model.generate(prompt)citations = extract_sources(response)return response + citations | • Requires model to explicitly cite sources for factual claims, making verification straightforward • enables users to trace statements back to original documents and identify unsupported assertions. |