Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

RAG (Retrieval Augmented Generation) Cheat Sheet

RAG (Retrieval Augmented Generation) Cheat Sheet

Tables
Back to Generative AI
Updated 2026-04-28
Next Topic: RAG Evaluation Cheat Sheet

Retrieval-Augmented Generation (RAG) is an LLM application pattern that retrieves external knowledge at query time and injects it into the model's context to produce grounded answers. Practitioners use RAG to reduce hallucinations, incorporate fresh/private data, and make outputs auditable via sources/citations. A useful mental model is that RAG is two coupled systems: an information retrieval system (indexes, rankers, filters) and a response synthesis system (prompting, citations, formatting). By 2026, production RAG has evolved well beyond naive chunk-and-retrieve — the dominant pattern is agentic RAG, where the LLM itself decides when, what, and how to retrieve. Most "RAG problems" are retrieval problems first—if the right evidence doesn't make it into context, generation quality can't recover it.

Quick Index119 entries · 14 tables
Mind Map

14 tables, 119 concepts. Select a concept node to jump to its table row.

Preparing mind map...

Table 1: RAG Building Blocks (Conceptual)

StageExampleDescription
Retrieval-Augmented Generation (RAG)
answer = LLM(question, context=top_k_docs)
Generates with retrieved evidence rather than relying only on parametric memory.
Ingestion
docs = loader.load_data()
Reads source data and converts it into document objects for downstream processing.
Chunking
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
Splits documents into smaller units so retrieval can target the right passages.
Embedding
emb = client.embeddings.create(model="text-embedding-3-small", input="...")
Maps text to vectors for similarity search in dense retrieval.
Indexing
index.add(xb)
Builds a search structure over vectors (exact or ANN) to enable fast retrieval.
Retrieval
docs = retriever.invoke("...")
Selects candidate chunks/documents relevant to the query.
Reranking
reranked = co.rerank(query=q, documents=texts)
Reorders retrieved candidates to improve top-k quality.
Synthesis
resp = query_engine.query("...")
Produces the final answer from retrieved context (often via an LLM).
Citations
resp = citation_engine.query("...")
Attaches sources to claims (usually at chunk-level granularity).
Evaluation
score = faithfulness
Measures retrieval + generation quality with task-appropriate metrics.

Table 2: Chunking and Splitting

SplitterExampleDescription
RecursiveCharacterTextSplitter
RecursiveCharacterTextSplitter(separators=["\n\n","\n"," ",""], chunk_size=1000, chunk_overlap=200)
• Default general-purpose splitter that tries separators in order to form sized chunks
• benchmark-validated best default.
SemanticChunker
SemanticChunker(embeddings, breakpoint_threshold_type="percentile")
• Splits at semantic boundary breaks in embedding space
• keeps topically coherent passages together.
TokenTextSplitter
TokenTextSplitter(chunk_size=512, chunk_overlap=64)
Splits by token count (useful when you need predictable model-context usage).
SentenceSplitter
SentenceSplitter(chunk_size=512, chunk_overlap=50)
Splits text into sentence-based chunks with size/overlap controls.
MarkdownHeaderTextSplitter
MarkdownHeaderTextSplitter(headers_to_split_on=[("#","h1"),("##","h2")])
Splits Markdown while preserving header structure as metadata.
HTMLHeaderTextSplitter
HTMLHeaderTextSplitter(headers_to_split_on=[("h1","h1"),("h2","h2")])
Splits HTML by header tags to keep section semantics.
CharacterTextSplitter
CharacterTextSplitter(separator="\n\n", chunk_size=1000, chunk_overlap=200)
Simple splitter using a fixed separator and target chunk size.
Contextual Chunking
contextualized_chunk = f"CONTEXT: {llm(doc, chunk)}\n\n{chunk}"
• Prepends an LLM-generated description of where each chunk fits in its document
• Anthropic found this reduces retrieval failures by 49%.
ParentDocumentRetriever
ParentDocumentRetriever(vectorstore=vs, docstore=store, child_splitter=small_splitter)
Indexes small child chunks for precise recall but returns the larger parent chunk to the LLM for richer context.
chunk_overlap
RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
Repeats tail tokens/characters across chunks to reduce boundary misses.
Metadata
Document(page_content=text, metadata={"source": url, "page": 12})
Carries provenance and filters (e.g., doc id, page, section) through retrieval.

Table 3: Embedding Models

ModelExampleDescription
text-embedding-3-small
client.embeddings.create(model="text-embedding-3-small", input="...")
• OpenAI's cost-effective embedding model
• 1536-dim, supports dimension reduction to 512 with minimal recall loss.
text-embedding-3-large
client.embeddings.create(model="text-embedding-3-large", input="...")
• OpenAI's high-accuracy model
• 3072-dim, supports reduction to 256+
• strong English performance across MTEB.
Cohere embed-v4
co.embed(texts=[...], model="embed-v4.0", input_type="search_document")
• Multimodal and multilingual
• 1536-dim, scores ~65.2 on MTEB
• best for non-English corpora and mixed-modality inputs.
Voyage voyage-3-large
vo.embed(texts, model="voyage-3-large", input_type="document")
• Voyage AI's flagship model
• outperforms text-embedding-3-large on MTEB
• domain variants available (code, finance, law).
BGE-M3
model = BGEM3FlagModel("BAAI/bge-m3"); model.encode(texts)
• Open-source
• supports dense, sparse, and multi-vector retrieval in one model
• self-hostable on a single GPU.
E5-Mistral
model = SentenceTransformer("intfloat/e5-mistral-7b-instruct")
• Open-source 7B model
• competitive with proprietary models on MTEB
• instruction-tuned for strong passage retrieval.

Table 4: Dense Indexing and ANN

IndexExampleDescription
HNSW
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops);
• Graph-based ANN structure
• the de-facto standard in production vector DBs due to low query latency at high recall.
IVFFlat
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
• Inverted-file ANN: partitions vectors into lists and probes a subset at query time
• lower build cost than HNSW.
IndexFlatL2
index = faiss.IndexFlatL2(d)
• Exact L2 distance brute-force baseline in FAISS
• useful for small corpora or ground-truth evaluation.
IndexFlatIP
index = faiss.IndexFlatIP(d)
Exact inner-product search baseline in FAISS.
Product Quantization (PQ)
index = faiss.IndexPQ(d, m, nbits)
Compresses vectors into short codes (6–10× compression) to reduce memory and speed up search.
Scalar Quantization (SQ)
ScalarQuantization(type=ScalarType.INT8)
• Quantizes each dimension to 8-bit integers
• lighter than PQ with minimal recall loss
• widely used in Qdrant.
Cosine
ORDER BY embedding <=> $1 LIMIT 10
Common similarity for normalized embeddings (implemented as distance operator in pgvector).
DotProduct
scores = x @ q
• Inner product similarity
• equivalent to cosine when vectors are L2-normalized.
Normalization
faiss.normalize_L2(x)
Makes cosine similarity retrieval equivalent to dot-product search.
ColBERT / Multi-Vector
rag = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0"); rag.search(q, k=5)
• Stores per-token embeddings
• scores with MaxSim (max cosine per query token)
• near cross-encoder accuracy at near bi-encoder speed.

Table 5: Vector Databases

DatabaseExampleDescription
pgvector
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);
• PostgreSQL extension
• keeps vectors and app data in the same table/transaction
• best default for <5M vectors if you already run Postgres.
Pinecone
index.upsert(vectors=[(id, emb, meta)]); index.query(vector=q, top_k=10)
• Fully-managed SaaS
• zero-ops serverless, auto-scales to billions of vectors
• supports sparse-dense hybrid search.
Qdrant
client.search("docs", query_vector=q, query_filter=Filter(...), limit=10)
• Open-source Rust-native DB
• rich payload filtering, scalar/product quantization, self-hosted or Qdrant Cloud.
Weaviate
collection.query.near_text("...", limit=10)
• Open-source
• built-in vectorization modules (auto-embeds raw text), built-in hybrid BM25+vector search
• GraphQL API.
Milvus / Zilliz
client.search(collection_name="docs", data=[q_emb], limit=10)
• Open-source
• GPU-accelerated, billion-scale distributed clusters
• Zilliz Cloud is the managed version.
Chroma
collection.query(query_texts=["..."], n_results=10)
• Embedded or client-server
• zero-setup developer experience
• best for prototyping and local development.
LanceDB
table.search(q_emb).limit(10).to_arrow()
• Zero-copy columnar storage (Lance format)
• embedded/in-process, disk-based indexing for larger-than-RAM datasets.

Table 6: Sparse and Hybrid Retrieval

RetrieverExampleDescription
HybridSearch
collection.query.hybrid(query="...", alpha=0.5, limit=10)
• Combines sparse and dense vector signals into one ranked list
• covers both semantic similarity and exact-match needs.
BM25
\text{score}(q,d)=\sum_{t\in q} \text{IDF}(t)\cdot\frac{f(t,d)\cdot(k_1+1)}{f(t,d)+k_1\cdot(1-b+b\cdot\frac{\lvert d \rvert}{\text{avgdl}})}
• Term-based ranking model used for keyword (sparse) retrieval
• strong on exact terms, error codes, and product names.
MMR
\arg\max_{d\in R\setminus S}\,\lambda\,\text{sim}(d,q)-(1-\lambda)\max_{d'\in S}\text{sim}(d,d')
• Diversifies selected chunks by trading off relevance vs redundancy
• avoids returning near-duplicate passages.
MetadataFiltering
where={"path":["source"],"operator":"Equal","valueString":"handbook"}
Restricts candidates to documents matching structured metadata predicates before vector search.
Fusion (RRF)
\text{RRF}(d)=\sum_i \frac{1}{k+\text{rank}_i(d)}
• Rank aggregation to merge results from multiple retrievers without needing calibrated scores
• k=60 is common.
TopK
docs = retriever.invoke(q, config={"configurable": {"k": 10}})
Controls how many candidates are returned from retrieval.
ScoreThreshold
as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.8})
Filters results by minimum similarity score before passing context to the LLM.
Similarity (BM25 config)
"similarity": {"default": {"type": "BM25", "k1": 1.2, "b": 0.75}}
Configures per-field BM25 scoring parameters in Elasticsearch.

Table 7: Query Transformation

TechniqueExampleDescription
QueryRewrite
"Rewrite the question for search: ..."
Converts user input into a search-optimized query string to improve vector recall.
MultiQueryRetriever
mqr = MultiQueryRetriever.from_llm(retriever, llm)
Uses an LLM to generate multiple query variants and unions retrieved docs for better recall.
HyDE
hyp = llm("Write a passage answering: ...")
docs = retriever.invoke(hyp)
• Retrieves using embeddings of a hypothetical document generated from the query
• helps with vague or sparse questions.
SelfQueryRetriever
sq = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info)
Uses an LLM to translate natural language into a structured query + metadata filters.
Decomposition
subqs = ["...", "..."]
Splits a complex question into sub-questions that are answered individually and combined.
Step-Back Prompting
broader = llm(f"What is the more general question behind: '{q}'")
docs = retriever.invoke(broader)
• First retrieves for a broader abstraction of the question before the specific query
• improves recall on specific questions.
Routing
route = router.invoke({"question": q})
• Selects a retriever/index/tool based on query intent or domain
• avoids wasting retrieval on the wrong corpus.
Expansion
q' = q + synonyms(q)
Adds related terms/phrases to improve recall in sparse retrieval.
Rephrase
docs = retriever.invoke(rephrased_q)
Normalizes user phrasing to reduce retrieval mismatch from casual or ambiguous language.

Table 8: Reranking and Fusion

RankerExampleDescription
Cross-encoder reranking
scores = cross_encoder.predict([(q, d) for d in docs])
• Scores each query-document pair jointly with full attention for higher-precision top-k
• compute cost is small vs LLM call.
Cohere Rerank
co.rerank(model="rerank-english-v3.0", query=q, documents=texts, top_n=10)
• API reranker producing a relevance-ordered list with scores
• Anthropic found adding reranking cuts retrieval failures by 67% combined with contextual retrieval.
BGE Reranker
reranker = FlagReranker("BAAI/bge-reranker-v2-m3"); scores = reranker.compute_score([(q,d) for d in docs])
• Open-source cross-encoder reranker
• multilingual, competitive with commercial rerankers
• self-hostable.
Reciprocal Rank Fusion (RRF)
\text{RRF}(d)=\sum_i \frac{1}{k+\text{rank}_i(d)}
• Fuses multiple ranked lists without needing calibrated similarity scores
• k=60 is a common default.
EnsembleRetriever
ens = EnsembleRetriever(retrievers=[r1, r2], weights=[0.5, 0.5])
Combines multiple retrievers and applies rank fusion to merge results.
ContextualCompressionRetriever
cc = ContextualCompressionRetriever(base_retriever=r, base_compressor=compressor)
Retrieves then compresses documents to only the query-relevant parts, reducing context noise.
Similarity cutoff
post = SimilarityPostprocessor(similarity_cutoff=0.8)
Drops nodes below a minimum similarity threshold before synthesis.
Deduplication
unique = list({d.page_content: d for d in docs}.values())
Removes duplicate chunks before feeding context to the LLM.
TopN
top_n=10
Limits reranker output to the highest-scoring items only.

Table 9: Query Engines and Answer Grounding

EngineExampleDescription
RetrieverQueryEngine
qe = RetrieverQueryEngine.from_args(retriever=retriever)
resp = qe.query("...")
LlamaIndex query engine that retrieves nodes then synthesizes a response.
CitationQueryEngine
qe = CitationQueryEngine.from_args(index=index)
resp = qe.query("...")
Generates answers with inline source citations anchored to retrieved chunks.
Grounding
"Answer only using the provided context."
• Forces the LLM to base claims on retrieved evidence rather than latent knowledge
• core anti-hallucination mechanism.
PromptTemplate
prompt = ChatPromptTemplate.from_messages([("system", "..."), ("human", "{question}")])
Parameterizes prompts so retrieval context and user input can be inserted reliably.
ResponseMode
response_mode=ResponseMode.COMPACT
Controls how retrieved text is composed into prompts and how answers are formed.
ContextWindow
max_output_tokens=512
Limits generation length (and indirectly budgets room for retrieved context).
Streaming
stream=True
• Streams partial tokens while a completion is being generated
• reduces perceived latency.
CiteGranularity
citation_chunk_size=512
Sets the chunk size used to form citation units for per-source attribution.

Table 10: Advanced RAG Architectures

PatternExampleDescription
Agentic RAG
tools=[search_knowledge_base]; agent.run(question)
• LLM decides when, what, and how to retrieve as a tool call
• handles multi-step and conditional retrieval needs.
GraphRAG
graphrag.query(query_type="local", query="...")
Microsoft's approach: extracts a knowledge graph from the corpus, builds community summaries, and retrieves via local or global search.
Corrective RAG (CRAG)
grade = evaluator.score(doc, q); if grade == "incorrect": web_search(q)
Lightweight evaluator grades each retrieved document (correct/ambiguous/incorrect) and triggers web search on failures.
Self-RAG
# model uses reflection tokens: [Retrieve], [ISREL], [ISSUP]
Trains a single LM to adaptively retrieve on-demand and self-critique retrieved passages and its own generations.
Multi-Hop RAG
for hop in range(MAX_HOPS): docs=retrieve(q); q=refine(q,docs)
• Chains multiple retrieval steps where each hop's results inform the next query
• needed for questions spanning multiple documents.
Contextual Retrieval
chunk = f"{llm_context(doc, chunk)}\n\n{chunk}"; embed(chunk)
• Anthropic technique: prepends chunk-specific context before embedding and BM25 indexing
• reduces retrieval failures by 49–67% combined with reranking.
Adaptive RAG
route = classifier(query); pipeline = routes[route]
A query complexity classifier routes each query to the appropriate pipeline — no retrieval, single-hop, or multi-hop — saving cost on simple queries.

Table 11: Storage, Persistence, and Caching

StoreExampleDescription
VectorStore
vectors = embed(texts); upsert(vectors, metadata)
Persists embeddings + payloads for similarity search at retrieval time.
DocStore
docstore.add_documents(docs)
• Stores full documents (separate from chunk/node indexes)
• used by ParentDocumentRetriever.
IndexStore
index_store.persist(persist_dir="./storage")
Persists index metadata/structures for reload without rebuilding.
StorageContext
ctx = StorageContext.from_defaults(persist_dir="./storage")
Bundles storage backends used by an index/query pipeline.
Persistence
CREATE TABLE items(id bigserial, content text, embedding vector(1536));
Makes embeddings/queryable data durable in a database.
Incremental Indexing
if doc.hash != stored.hash: re_embed(doc)
Re-indexes only changed documents to keep the vector store current without full rebuilds.
EmbeddingCache
cache_key = sha256(text)
• Avoids recomputing embeddings for identical inputs
• critical for cost control at scale.
Namespace
index.query(namespace="prod", vector=q, top_k=10)
Separates tenant or environment data within one vector index for multi-tenant isolation.
TTL
SET key value EX 3600
Expires cached retrieval/generation artifacts after a time-to-live window.

Table 12: RAG Evaluation Metrics and Frameworks

MetricExampleDescription
Faithfulness
faithfulness ∈ [0,1]
• Measures consistency with retrieved context
• the primary anti-hallucination metric.
Answer Relevancy
answer_relevancy ∈ [0,1]
Measures how well the answer addresses the question.
Context Precision
context_precision ∈ [0,1]
Measures whether retrieved contexts are useful for answering the question.
Context Recall
context_recall ∈ [0,1]
Measures how much of the needed information is present in retrieved contexts.
Factual Correctness
factual_correctness ∈ [0,1]
Measures whether the answer is factually correct against a reference.
Noise Sensitivity
noise_sensitivity ∈ [0,1]
Measures robustness to irrelevant context — does the answer degrade when noisy chunks are included?
Context Entities Recall
context_entities_recall ∈ [0,1]
Measures entity recall over retrieved context vs reference.
Aspect Critic
aspect_critic
LLM-judge style metric for assessing a specific aspect of the output (e.g., conciseness, harmlessness).
DeepEval
@assert_test; def test_rag(): assert_llm(actual_output, expected_output, [AnswerRelevancyMetric()])
• pytest-compatible LLM testing framework with 14+ metrics
• designed for CI/CD quality gates.
Arize Phoenix
px.launch_app(); tracer = register(project_name="rag")
• Open-source AI observability platform
• OpenTelemetry-based tracing with built-in RAG evaluators
• self-hostable.
LangSmith
client = langsmith.Client(); client.run_on_dataset(dataset_name="rag-eval", llm_or_chain=chain)
• LangChain-native tracing and evaluation platform
• deep visibility into chain execution steps and LLM calls.
Golden Dataset
[{"question": q, "ground_truth": a, "source_docs": []}]
• Curated Q&A set with approved answers and source documents
• used to benchmark pipeline changes before deployment.

Table 13: Observability and Security

SignalExampleDescription
GenAI spans
gen_ai.operation.name = "chat"
Standardizes tracing semantics for GenAI operations (inference, retrieval, tools) via OpenTelemetry.
Retrieval span
gen_ai.retrieval.count = 10
Captures retrieval metadata (chunk count, latency, scores) to debug relevance vs latency tradeoffs.
Token metrics
gen_ai.client.token.usage
Records token usage to monitor cost and performance over time.
Langfuse
langfuse = Langfuse(); trace = langfuse.trace(name="rag-query")
Open-source LLM observability platform with trace-based debugging, evals, and a prompt management UI.
Prompt Injection
"Ignore previous instructions and ..."
Attacker-controlled input attempts to override system/developer intent via the retrieval context.
Sensitive Info Disclosure
"Print your hidden prompt"
Leakage of secrets, system prompts, or private data via model outputs.
Data Poisoning
"Upload malicious docs into the KB"
Corrupts the retrieval corpus so the model is grounded in incorrect or malicious context.
Insecure Output Handling
render_html(llm_output)
Treating model output as trusted can lead to downstream injection or execution.
Model DoS
"Summarize 10MB of text"
Attacks that drive excessive compute/cost via large inputs or adversarial usage.

Table 14: Multimodal RAG

TechniqueExampleDescription
ColPali
model = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2"); model.index(pdf_folder)
• VLM-based retriever that produces multi-vector embeddings directly from document page images via late interaction
• no OCR needed.
Byaldi
docs_model = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2"); results = docs_model.search(query, k=3)
Python library wrapping ColPali with a familiar API for indexing PDFs and searching by visual content.
VLM Synthesis
vl_model.generate(images=retrieved_pages, text=query)
Uses a Vision Language Model (e.g., Qwen2-VL, GPT-4V) to answer based on retrieved document page images.
OCR Pipeline
text = ocr_engine.extract(page_image); embed(text)
• Traditional text-extraction pipeline before embedding
• superseded by ColPali for documents with complex layouts, tables, and figures.
Back to Generative AI
Next Topic: RAG Evaluation Cheat Sheet

References

Official Documentation

  1. https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/
  2. https://docs.llamaindex.ai/en/stable/module_guides/deploying/query_engine/
  3. https://docs.llamaindex.ai/en/stable/examples/query_engine/citation_query_engine/
  4. https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/
  5. https://docs.llamaindex.ai/en/stable/api_reference/postprocessor/similarity/
  6. https://docs.llamaindex.ai/en/stable/module_guides/deploying/response_synthesizers/
  7. https://docs.llamaindex.ai/en/stable/module_guides/storing/docstores/
  8. https://docs.llamaindex.ai/en/stable/module_guides/storing/index_stores/
  9. https://docs.llamaindex.ai/en/stable/module_guides/storing/storage_context/
  10. https://docs.llamaindex.ai/en/stable/examples/query_engine/sub_question_query_engine/
  11. https://python.langchain.com/api_reference/core/retrievers/langchain_core.retrievers.BaseRetriever.html
  12. https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html
  13. https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html
  14. https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TokenTextSplitter.html
  15. https://python.langchain.com/api_reference/text_splitters/markdown/langchain_text_splitters.markdown.MarkdownHeaderTextSplitter.html
  16. https://python.langchain.com/api_reference/text_splitters/html/langchain_text_splitters.html.HTMLHeaderTextSplitter.html
  17. https://python.langchain.com/docs/how_to/semantic-chunker/
  18. https://python.langchain.com/api_reference/langchain/retrievers/langchain.retrievers.ensemble.EnsembleRetriever.html
  19. https://python.langchain.com/api_reference/langchain/retrievers/langchain.retrievers.contextual_compression.ContextualCompressionRetriever.html
  20. https://python.langchain.com/api_reference/langchain/retrievers/langchain.retrievers.multi_query.MultiQueryRetriever.html
  21. https://python.langchain.com/api_reference/langchain/retrievers/langchain.retrievers.parent_document_retriever.ParentDocumentRetriever.html
  22. https://python.langchain.com/docs/how_to/self_query/
  23. https://python.langchain.com/docs/how_to/routing/
  24. https://python.langchain.com/docs/concepts/prompt_templates/
  25. https://python.langchain.com/docs/concepts/text_splitters/
  26. https://python.langchain.com/docs/how_to/vectorstore_retriever/
  27. https://platform.openai.com/docs/guides/prompt-engineering
  28. https://platform.openai.com/docs/guides/embeddings
  29. https://platform.openai.com/docs/guides/text-generation
  30. https://platform.openai.com/docs/guides/streaming-responses
  31. https://docs.cohere.com/reference/rerank
  32. https://docs.cohere.com/docs/rerank-overview
  33. https://docs.cohere.com/reference/embed
  34. https://docs.voyageai.com/docs/embeddings
  35. https://faiss.ai/index.html
  36. https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexFlatIP.html
  37. https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexFlatL2.html
  38. https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexPQ.html
  39. https://faiss.ai/cpp_api/namespace/namespacefaiss.html
  40. https://github.com/pgvector/pgvector
  41. https://docs.pinecone.io/
  42. https://docs.pinecone.io/guides/indexes/use-namespaces
  43. https://qdrant.tech/documentation/
  44. https://qdrant.tech/documentation/guides/quantization/
  45. https://docs.weaviate.io/weaviate/search/hybrid
  46. https://docs.weaviate.io/weaviate/search/filters
  47. https://docs.weaviate.io/weaviate/concepts/data
  48. https://docs.weaviate.io/
  49. https://milvus.io/docs/overview.md
  50. https://docs.trychroma.com/
  51. https://lancedb.github.io/lancedb/
  52. https://www.elastic.co/docs/reference/elasticsearch/index-settings/similarity
  53. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/
  54. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/
  55. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_relevancy/
  56. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/
  57. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_recall/
  58. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_entities_recall/
  59. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/noise_sensitivity/
  60. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/aspect_critic/
  61. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/factual_correctness/
  62. https://docs.confident-ai.com/
  63. https://phoenix.arize.com/
  64. https://docs.smith.langchain.com/
  65. https://langfuse.com/docs
  66. https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
  67. https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/
  68. https://opentelemetry.io/docs/specs/semconv/registry/attributes/gen-ai/
  69. https://owasp.org/www-project-top-10-for-large-language-model-applications/
  70. https://genai.owasp.org/llmrisk/llm01-prompt-injection/
  71. https://genai.owasp.org/llmrisk/llm02-sensitive-information-disclosure/
  72. https://genai.owasp.org/llmrisk/llm03-data-poisoning/
  73. https://genai.owasp.org/llmrisk/llm04-insecure-output-handling/
  74. https://genai.owasp.org/llmrisk/llm05-model-denial-of-service/
  75. https://redis.io/docs/latest/develop/data-types/strings/
  76. https://microsoft.github.io/graphrag/
  77. https://www.microsoft.com/en-us/research/project/graphrag/
  78. https://huggingface.co/BAAI/bge-m3
  79. https://huggingface.co/BAAI/bge-reranker-v2-m3
  80. https://huggingface.co/intfloat/e5-mistral-7b-instruct
  81. https://huggingface.co/learn/cookbook/multimodal_rag_using_document_retrieval_and_vlms
  82. https://github.com/stanford-futuredata/ColBERT
  83. https://github.com/AnswerDotAI/byaldi
  84. https://supabase.com/docs/guides/database/extensions/pgvector
  85. https://learn.microsoft.com/en-us/azure/developer/ai/advanced-retrieval-augmented-generation
  86. https://www.anthropic.com/news/contextual-retrieval

Academic Papers

  1. https://arxiv.org/abs/2005.11401
  2. https://arxiv.org/pdf/2005.11401
  3. https://arxiv.org/abs/2212.10496
  4. https://arxiv.org/pdf/2212.10496
  5. https://arxiv.org/abs/2310.11511
  6. https://arxiv.org/abs/2401.15884
  7. https://arxiv.org/abs/2501.09136
  8. https://arxiv.org/abs/2407.01449
  9. https://arxiv.org/abs/2403.14403
  10. https://arxiv.org/abs/2310.06117
  11. https://arxiv.org/abs/1603.09320
  12. https://arxiv.org/abs/2511.00444
  13. https://dl.acm.org/doi/10.1145/1571941.1572114
  14. https://dl.acm.org/doi/10.1145/290941.291025
  15. https://cormack.uwaterloo.ca/cormacksigir09-rrf.pdf
  16. https://aclanthology.org/2023.acl-long.99/
  17. https://openreview.net/forum?id=ogjBpZ8uSi

Technical Blogs & Tutorials

  1. https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
  2. https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/
  3. https://weaviate.io/blog/hybrid-search-explained
  4. https://www.postgresql.org/about/news/pgvector-080-released-2952/
  5. https://aws.amazon.com/blogs/database/optimize-generative-ai-applications-with-pgvector-indexing-a-deep-dive-into-ivfflat-and-hnsw-techniques/
  6. https://redis.io/blog/get-better-rag-responses-with-ragas/
  7. https://qdrant.tech/blog/rag-evaluation-guide/
  8. https://www.datadoghq.com/blog/llm-otel-semantic-convention/
  9. https://www.mongodb.com/docs/atlas/ai-integrations/langchain/parent-document-retrieval/
  10. https://www.sbert.net/docs/cross-encoder.html
  11. https://www.sbert.net/examples/applications/cross-encoder/README.html
  12. https://nlp.stanford.edu/IR-book/html/htmledition/query-expansion-1.html
  13. https://blog.premai.io/rag-chunking-strategies-the-2026-benchmark-guide/
  14. https://blog.premai.io/rag-evaluation-metrics-frameworks-testing-2026/
  15. https://blog.premai.io/best-embedding-models-for-rag-2026-ranked-by-mteb-score-cost-and-self-hosting/
  16. https://www.firecrawl.dev/blog/best-chunking-strategies-rag
  17. https://www.firecrawl.dev/blog/best-vector-databases
  18. https://callsphere.tech/blog/rag-architecture-patterns-2026-retrieval-augmented-generation
  19. https://rapidclaw.dev/blog/rag-architecture-ai-agents-guide-2026
  20. https://dev.to/young_gao/rag-is-not-dead-advanced-retrieval-patterns-that-actually-work-in-2026-2gbo
  21. https://encore.dev/articles/best-vector-databases
  22. https://procogia.com/unlocking-rags-potential-mastering-advanced-techniques-part-1/
  23. https://www.lancedb.com/blog/modified-rag-parent-document-bigger-chunk-retriever-62b3d1e79bc6
  24. https://www.meilisearch.com/blog/graph-rag
  25. https://graphrag.com/concepts/intro-to-graphrag/
  26. https://weaviate.io/blog/late-interaction-overview
  27. https://blog.gopenai.com/the-fidelity-crisis-in-rag-why-late-interaction-colbert-is-the-4k-image-of-search-vs-e978d96b25b8
  28. https://deepeval.com/blog/deepeval-vs-trulens
  29. https://www.getmaxim.ai/articles/the-5-best-rag-evaluation-tools-you-should-know-in-2026/
  30. https://rhesis.ai/post/best-llm-evaluation-testing-tools
  31. https://ragflow.io/blog/rag-review-2025-from-rag-to-context
  32. https://blog.starmorph.com/blog/rag-techniques-compared-best-practices-guide

GitHub Repositories & Code Examples

  1. https://github.com/langchain-ai/langchain
  2. https://github.com/run-llama/llama_index
  3. https://github.com/facebookresearch/faiss
  4. https://github.com/pgvector/pgvector
  5. https://github.com/microsoft/graphrag
  6. https://github.com/AnswerDotAI/byaldi
  7. https://github.com/stanford-futuredata/ColBERT
  8. https://github.com/PranavGovindu/Self-Corrective-Agentic-RAG

Video Resources

  1. https://www.youtube.com/watch?v=j66Db1SB1YY
  2. https://www.youtube.com/watch?v=nMCII_xtUbw
  3. https://www.youtube.com/watch?v=VfjIYjYFVt4
  4. https://www.youtube.com/watch?v=0fackgiKTiA
  5. https://www.youtube.com/watch?v=-1zMU1a625E
  6. https://www.youtube.com/watch?v=vT-DpLvf29Q

Industry Best Practice Guides & Books

  1. https://www.aws.amazon.com/blogs/machine-learning/contextual-retrieval-in-anthropic-using-amazon-bedrock-knowledge-bases/
  2. https://nemorize.com/roadmaps/2026-modern-ai-search-rag-roadmap/lessons/agentic-rag-systems
  3. https://pecollective.com/tools/best-embedding-models/

More in Generative AI

  • Prompt Engineering Cheat Sheet
  • RAG Evaluation Cheat Sheet
  • Advanced RAG Patterns and Optimization Cheat Sheet
  • Chain-of-Thought Reasoning Cheat Sheet
  • Knowledge Distillation Cheat Sheet
  • MCP Servers Implementation Cheat Sheet
View all 77 topics in Generative AI