Retrieval-Augmented Generation (RAG) is an LLM application pattern that retrieves external knowledge at query time and injects it into the model's context to produce grounded answers. Practitioners use RAG to reduce hallucinations, incorporate fresh/private data, and make outputs auditable via sources/citations. A useful mental model is that RAG is two coupled systems: an information retrieval system (indexes, rankers, filters) and a response synthesis system (prompting, citations, formatting). By 2026, production RAG has evolved well beyond naive chunk-and-retrieve — the dominant pattern is agentic RAG, where the LLM itself decides when, what, and how to retrieve. Most "RAG problems" are retrieval problems first—if the right evidence doesn't make it into context, generation quality can't recover it.
14 tables, 119 concepts. Select a concept node to jump to its table row.
Table 1: RAG Building Blocks (Conceptual)
| Stage | Example | Description |
|---|---|---|
answer = LLM(question, context=top_k_docs) | Generates with retrieved evidence rather than relying only on parametric memory. | |
docs = loader.load_data() | Reads source data and converts it into document objects for downstream processing. | |
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) | Splits documents into smaller units so retrieval can target the right passages. | |
emb = client.embeddings.create(model="text-embedding-3-small", input="...") | Maps text to vectors for similarity search in dense retrieval. | |
index.add(xb) | Builds a search structure over vectors (exact or ANN) to enable fast retrieval. | |
docs = retriever.invoke("...") | Selects candidate chunks/documents relevant to the query. | |
reranked = co.rerank(query=q, documents=texts) | Reorders retrieved candidates to improve top-k quality. | |
resp = query_engine.query("...") | Produces the final answer from retrieved context (often via an LLM). | |
resp = citation_engine.query("...") | Attaches sources to claims (usually at chunk-level granularity). | |
score = faithfulness | Measures retrieval + generation quality with task-appropriate metrics. |
Table 2: Chunking and Splitting
| Splitter | Example | Description |
|---|---|---|
RecursiveCharacterTextSplitter(separators=["\n\n","\n"," ",""], chunk_size=1000, chunk_overlap=200) | • Default general-purpose splitter that tries separators in order to form sized chunks • benchmark-validated best default. | |
SemanticChunker(embeddings, breakpoint_threshold_type="percentile") | • Splits at semantic boundary breaks in embedding space • keeps topically coherent passages together. | |
TokenTextSplitter(chunk_size=512, chunk_overlap=64) | Splits by token count (useful when you need predictable model-context usage). | |
SentenceSplitter(chunk_size=512, chunk_overlap=50) | Splits text into sentence-based chunks with size/overlap controls. | |
MarkdownHeaderTextSplitter(headers_to_split_on=[("#","h1"),("##","h2")]) | Splits Markdown while preserving header structure as metadata. | |
HTMLHeaderTextSplitter(headers_to_split_on=[("h1","h1"),("h2","h2")]) | Splits HTML by header tags to keep section semantics. | |
CharacterTextSplitter(separator="\n\n", chunk_size=1000, chunk_overlap=200) | Simple splitter using a fixed separator and target chunk size. | |
contextualized_chunk = f"CONTEXT: {llm(doc, chunk)}\n\n{chunk}" | • Prepends an LLM-generated description of where each chunk fits in its document • Anthropic found this reduces retrieval failures by 49%. | |
ParentDocumentRetriever(vectorstore=vs, docstore=store, child_splitter=small_splitter) | Indexes small child chunks for precise recall but returns the larger parent chunk to the LLM for richer context. | |
RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) | Repeats tail tokens/characters across chunks to reduce boundary misses. | |
Document(page_content=text, metadata={"source": url, "page": 12}) | Carries provenance and filters (e.g., doc id, page, section) through retrieval. |
Table 3: Embedding Models
| Model | Example | Description |
|---|---|---|
client.embeddings.create(model="text-embedding-3-small", input="...") | • OpenAI's cost-effective embedding model • 1536-dim, supports dimension reduction to 512 with minimal recall loss. | |
client.embeddings.create(model="text-embedding-3-large", input="...") | • OpenAI's high-accuracy model • 3072-dim, supports reduction to 256+ • strong English performance across MTEB. | |
co.embed(texts=[...], model="embed-v4.0", input_type="search_document") | • Multimodal and multilingual • 1536-dim, scores ~65.2 on MTEB • best for non-English corpora and mixed-modality inputs. | |
vo.embed(texts, model="voyage-3-large", input_type="document") | • Voyage AI's flagship model • outperforms text-embedding-3-large on MTEB • domain variants available (code, finance, law). | |
model = BGEM3FlagModel("BAAI/bge-m3"); model.encode(texts) | • Open-source • supports dense, sparse, and multi-vector retrieval in one model • self-hostable on a single GPU. | |
model = SentenceTransformer("intfloat/e5-mistral-7b-instruct") | • Open-source 7B model • competitive with proprietary models on MTEB • instruction-tuned for strong passage retrieval. |
Table 4: Dense Indexing and ANN
| Index | Example | Description |
|---|---|---|
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops); | • Graph-based ANN structure • the de-facto standard in production vector DBs due to low query latency at high recall. | |
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); | • Inverted-file ANN: partitions vectors into lists and probes a subset at query time• lower build cost than HNSW. | |
index = faiss.IndexFlatL2(d) | • Exact L2 distance brute-force baseline in FAISS • useful for small corpora or ground-truth evaluation. | |
index = faiss.IndexFlatIP(d) | Exact inner-product search baseline in FAISS. | |
index = faiss.IndexPQ(d, m, nbits) | Compresses vectors into short codes (6–10× compression) to reduce memory and speed up search. | |
ScalarQuantization(type=ScalarType.INT8) | • Quantizes each dimension to 8-bit integers • lighter than PQ with minimal recall loss • widely used in Qdrant. | |
ORDER BY embedding <=> $1 LIMIT 10 | Common similarity for normalized embeddings (implemented as distance operator in pgvector). | |
scores = x @ q | • Inner product similarity • equivalent to cosine when vectors are L2-normalized. | |
faiss.normalize_L2(x) | Makes cosine similarity retrieval equivalent to dot-product search. | |
rag = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0"); rag.search(q, k=5) | • Stores per-token embeddings • scores with MaxSim (max cosine per query token) • near cross-encoder accuracy at near bi-encoder speed. |
Table 5: Vector Databases
| Database | Example | Description |
|---|---|---|
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops); | • PostgreSQL extension • keeps vectors and app data in the same table/transaction • best default for <5M vectors if you already run Postgres. | |
index.upsert(vectors=[(id, emb, meta)]); index.query(vector=q, top_k=10) | • Fully-managed SaaS • zero-ops serverless, auto-scales to billions of vectors • supports sparse-dense hybrid search. | |
client.search("docs", query_vector=q, query_filter=Filter(...), limit=10) | • Open-source Rust-native DB • rich payload filtering, scalar/product quantization, self-hosted or Qdrant Cloud. | |
collection.query.near_text("...", limit=10) | • Open-source • built-in vectorization modules (auto-embeds raw text), built-in hybrid BM25+vector search • GraphQL API. | |
client.search(collection_name="docs", data=[q_emb], limit=10) | • Open-source • GPU-accelerated, billion-scale distributed clusters • Zilliz Cloud is the managed version. | |
collection.query(query_texts=["..."], n_results=10) | • Embedded or client-server • zero-setup developer experience • best for prototyping and local development. | |
table.search(q_emb).limit(10).to_arrow() | • Zero-copy columnar storage (Lance format) • embedded/in-process, disk-based indexing for larger-than-RAM datasets. |
Table 6: Sparse and Hybrid Retrieval
| Retriever | Example | Description |
|---|---|---|
collection.query.hybrid(query="...", alpha=0.5, limit=10) | • Combines sparse and dense vector signals into one ranked list • covers both semantic similarity and exact-match needs. | |
\text{score}(q,d)=\sum_{t\in q} \text{IDF}(t)\cdot\frac{f(t,d)\cdot(k_1+1)}{f(t,d)+k_1\cdot(1-b+b\cdot\frac{\lvert d \rvert}{\text{avgdl}})} | • Term-based ranking model used for keyword (sparse) retrieval • strong on exact terms, error codes, and product names. | |
\arg\max_{d\in R\setminus S}\,\lambda\,\text{sim}(d,q)-(1-\lambda)\max_{d'\in S}\text{sim}(d,d') | • Diversifies selected chunks by trading off relevance vs redundancy • avoids returning near-duplicate passages. | |
where={"path":["source"],"operator":"Equal","valueString":"handbook"} | Restricts candidates to documents matching structured metadata predicates before vector search. | |
\text{RRF}(d)=\sum_i \frac{1}{k+\text{rank}_i(d)} | • Rank aggregation to merge results from multiple retrievers without needing calibrated scores • k=60 is common. | |
docs = retriever.invoke(q, config={"configurable": {"k": 10}}) | Controls how many candidates are returned from retrieval. | |
as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.8}) | Filters results by minimum similarity score before passing context to the LLM. | |
"similarity": {"default": {"type": "BM25", "k1": 1.2, "b": 0.75}} | Configures per-field BM25 scoring parameters in Elasticsearch. |
Table 7: Query Transformation
| Technique | Example | Description |
|---|---|---|
"Rewrite the question for search: ..." | Converts user input into a search-optimized query string to improve vector recall. | |
mqr = MultiQueryRetriever.from_llm(retriever, llm) | Uses an LLM to generate multiple query variants and unions retrieved docs for better recall. | |
hyp = llm("Write a passage answering: ...")docs = retriever.invoke(hyp) | • Retrieves using embeddings of a hypothetical document generated from the query • helps with vague or sparse questions. | |
sq = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info) | Uses an LLM to translate natural language into a structured query + metadata filters. | |
subqs = ["...", "..."] | Splits a complex question into sub-questions that are answered individually and combined. | |
broader = llm(f"What is the more general question behind: '{q}'")docs = retriever.invoke(broader) | • First retrieves for a broader abstraction of the question before the specific query • improves recall on specific questions. | |
route = router.invoke({"question": q}) | • Selects a retriever/index/tool based on query intent or domain • avoids wasting retrieval on the wrong corpus. | |
q' = q + synonyms(q) | Adds related terms/phrases to improve recall in sparse retrieval. | |
docs = retriever.invoke(rephrased_q) | Normalizes user phrasing to reduce retrieval mismatch from casual or ambiguous language. |
Table 8: Reranking and Fusion
| Ranker | Example | Description |
|---|---|---|
scores = cross_encoder.predict([(q, d) for d in docs]) | • Scores each query-document pair jointly with full attention for higher-precision top-k • compute cost is small vs LLM call. | |
co.rerank(model="rerank-english-v3.0", query=q, documents=texts, top_n=10) | • API reranker producing a relevance-ordered list with scores • Anthropic found adding reranking cuts retrieval failures by 67% combined with contextual retrieval. | |
reranker = FlagReranker("BAAI/bge-reranker-v2-m3"); scores = reranker.compute_score([(q,d) for d in docs]) | • Open-source cross-encoder reranker • multilingual, competitive with commercial rerankers • self-hostable. | |
\text{RRF}(d)=\sum_i \frac{1}{k+\text{rank}_i(d)} | • Fuses multiple ranked lists without needing calibrated similarity scores • k=60 is a common default. | |
ens = EnsembleRetriever(retrievers=[r1, r2], weights=[0.5, 0.5]) | Combines multiple retrievers and applies rank fusion to merge results. | |
cc = ContextualCompressionRetriever(base_retriever=r, base_compressor=compressor) | Retrieves then compresses documents to only the query-relevant parts, reducing context noise. | |
post = SimilarityPostprocessor(similarity_cutoff=0.8) | Drops nodes below a minimum similarity threshold before synthesis. | |
unique = list({d.page_content: d for d in docs}.values()) | Removes duplicate chunks before feeding context to the LLM. | |
top_n=10 | Limits reranker output to the highest-scoring items only. |
Table 9: Query Engines and Answer Grounding
| Engine | Example | Description |
|---|---|---|
qe = RetrieverQueryEngine.from_args(retriever=retriever)resp = qe.query("...") | LlamaIndex query engine that retrieves nodes then synthesizes a response. | |
qe = CitationQueryEngine.from_args(index=index)resp = qe.query("...") | Generates answers with inline source citations anchored to retrieved chunks. | |
"Answer only using the provided context." | • Forces the LLM to base claims on retrieved evidence rather than latent knowledge • core anti-hallucination mechanism. | |
prompt = ChatPromptTemplate.from_messages([("system", "..."), ("human", "{question}")]) | Parameterizes prompts so retrieval context and user input can be inserted reliably. | |
response_mode=ResponseMode.COMPACT | Controls how retrieved text is composed into prompts and how answers are formed. | |
max_output_tokens=512 | Limits generation length (and indirectly budgets room for retrieved context). | |
stream=True | • Streams partial tokens while a completion is being generated • reduces perceived latency. | |
citation_chunk_size=512 | Sets the chunk size used to form citation units for per-source attribution. |
Table 10: Advanced RAG Architectures
| Pattern | Example | Description |
|---|---|---|
tools=[search_knowledge_base]; agent.run(question) | • LLM decides when, what, and how to retrieve as a tool call • handles multi-step and conditional retrieval needs. | |
graphrag.query(query_type="local", query="...") | Microsoft's approach: extracts a knowledge graph from the corpus, builds community summaries, and retrieves via local or global search. | |
grade = evaluator.score(doc, q); if grade == "incorrect": web_search(q) | Lightweight evaluator grades each retrieved document (correct/ambiguous/incorrect) and triggers web search on failures. | |
# model uses reflection tokens: [Retrieve], [ISREL], [ISSUP] | Trains a single LM to adaptively retrieve on-demand and self-critique retrieved passages and its own generations. | |
for hop in range(MAX_HOPS): docs=retrieve(q); q=refine(q,docs) | • Chains multiple retrieval steps where each hop's results inform the next query • needed for questions spanning multiple documents. | |
chunk = f"{llm_context(doc, chunk)}\n\n{chunk}"; embed(chunk) | • Anthropic technique: prepends chunk-specific context before embedding and BM25 indexing • reduces retrieval failures by 49–67% combined with reranking. | |
route = classifier(query); pipeline = routes[route] | A query complexity classifier routes each query to the appropriate pipeline — no retrieval, single-hop, or multi-hop — saving cost on simple queries. |
Table 11: Storage, Persistence, and Caching
| Store | Example | Description |
|---|---|---|
vectors = embed(texts); upsert(vectors, metadata) | Persists embeddings + payloads for similarity search at retrieval time. | |
docstore.add_documents(docs) | • Stores full documents (separate from chunk/node indexes) • used by ParentDocumentRetriever. | |
index_store.persist(persist_dir="./storage") | Persists index metadata/structures for reload without rebuilding. | |
ctx = StorageContext.from_defaults(persist_dir="./storage") | Bundles storage backends used by an index/query pipeline. | |
CREATE TABLE items(id bigserial, content text, embedding vector(1536)); | Makes embeddings/queryable data durable in a database. | |
if doc.hash != stored.hash: re_embed(doc) | Re-indexes only changed documents to keep the vector store current without full rebuilds. | |
cache_key = sha256(text) | • Avoids recomputing embeddings for identical inputs • critical for cost control at scale. | |
index.query(namespace="prod", vector=q, top_k=10) | Separates tenant or environment data within one vector index for multi-tenant isolation. | |
SET key value EX 3600 | Expires cached retrieval/generation artifacts after a time-to-live window. |
Table 12: RAG Evaluation Metrics and Frameworks
| Metric | Example | Description |
|---|---|---|
faithfulness ∈ [0,1] | • Measures consistency with retrieved context • the primary anti-hallucination metric. | |
answer_relevancy ∈ [0,1] | Measures how well the answer addresses the question. | |
context_precision ∈ [0,1] | Measures whether retrieved contexts are useful for answering the question. | |
context_recall ∈ [0,1] | Measures how much of the needed information is present in retrieved contexts. | |
factual_correctness ∈ [0,1] | Measures whether the answer is factually correct against a reference. | |
noise_sensitivity ∈ [0,1] | Measures robustness to irrelevant context — does the answer degrade when noisy chunks are included? | |
context_entities_recall ∈ [0,1] | Measures entity recall over retrieved context vs reference. | |
aspect_critic | LLM-judge style metric for assessing a specific aspect of the output (e.g., conciseness, harmlessness). | |
| • pytest-compatible LLM testing framework with 14+ metrics • designed for CI/CD quality gates. | |
px.launch_app(); tracer = register(project_name="rag") | • Open-source AI observability platform • OpenTelemetry-based tracing with built-in RAG evaluators • self-hostable. | |
client = langsmith.Client(); client.run_on_dataset(dataset_name="rag-eval", llm_or_chain=chain) | • LangChain-native tracing and evaluation platform • deep visibility into chain execution steps and LLM calls. | |
[{"question": q, "ground_truth": a, "source_docs": []}] | • Curated Q&A set with approved answers and source documents • used to benchmark pipeline changes before deployment. |
Table 13: Observability and Security
| Signal | Example | Description |
|---|---|---|
gen_ai.operation.name = "chat" | Standardizes tracing semantics for GenAI operations (inference, retrieval, tools) via OpenTelemetry. | |
gen_ai.retrieval.count = 10 | Captures retrieval metadata (chunk count, latency, scores) to debug relevance vs latency tradeoffs. | |
gen_ai.client.token.usage | Records token usage to monitor cost and performance over time. | |
langfuse = Langfuse(); trace = langfuse.trace(name="rag-query") | Open-source LLM observability platform with trace-based debugging, evals, and a prompt management UI. | |
"Ignore previous instructions and ..." | Attacker-controlled input attempts to override system/developer intent via the retrieval context. | |
"Print your hidden prompt" | Leakage of secrets, system prompts, or private data via model outputs. | |
"Upload malicious docs into the KB" | Corrupts the retrieval corpus so the model is grounded in incorrect or malicious context. | |
render_html(llm_output) | Treating model output as trusted can lead to downstream injection or execution. | |
"Summarize 10MB of text" | Attacks that drive excessive compute/cost via large inputs or adversarial usage. |
Table 14: Multimodal RAG
| Technique | Example | Description |
|---|---|---|
model = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2"); model.index(pdf_folder) | • VLM-based retriever that produces multi-vector embeddings directly from document page images via late interaction • no OCR needed. | |
docs_model = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2"); results = docs_model.search(query, k=3) | Python library wrapping ColPali with a familiar API for indexing PDFs and searching by visual content. | |
vl_model.generate(images=retrieved_pages, text=query) | Uses a Vision Language Model (e.g., Qwen2-VL, GPT-4V) to answer based on retrieved document page images. | |
text = ocr_engine.extract(page_image); embed(text) | • Traditional text-extraction pipeline before embedding • superseded by ColPali for documents with complex layouts, tables, and figures. |
References
Official Documentation
- https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/
- https://docs.llamaindex.ai/en/stable/module_guides/deploying/query_engine/
- https://docs.llamaindex.ai/en/stable/examples/query_engine/citation_query_engine/
- https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/
- https://docs.llamaindex.ai/en/stable/api_reference/postprocessor/similarity/
- https://docs.llamaindex.ai/en/stable/module_guides/deploying/response_synthesizers/
- https://docs.llamaindex.ai/en/stable/module_guides/storing/docstores/
- https://docs.llamaindex.ai/en/stable/module_guides/storing/index_stores/
- https://docs.llamaindex.ai/en/stable/module_guides/storing/storage_context/
- https://docs.llamaindex.ai/en/stable/examples/query_engine/sub_question_query_engine/
- https://python.langchain.com/api_reference/core/retrievers/langchain_core.retrievers.BaseRetriever.html
- https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html
- https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html
- https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TokenTextSplitter.html
- https://python.langchain.com/api_reference/text_splitters/markdown/langchain_text_splitters.markdown.MarkdownHeaderTextSplitter.html
- https://python.langchain.com/api_reference/text_splitters/html/langchain_text_splitters.html.HTMLHeaderTextSplitter.html
- https://python.langchain.com/docs/how_to/semantic-chunker/
- https://python.langchain.com/api_reference/langchain/retrievers/langchain.retrievers.ensemble.EnsembleRetriever.html
- https://python.langchain.com/api_reference/langchain/retrievers/langchain.retrievers.contextual_compression.ContextualCompressionRetriever.html
- https://python.langchain.com/api_reference/langchain/retrievers/langchain.retrievers.multi_query.MultiQueryRetriever.html
- https://python.langchain.com/api_reference/langchain/retrievers/langchain.retrievers.parent_document_retriever.ParentDocumentRetriever.html
- https://python.langchain.com/docs/how_to/self_query/
- https://python.langchain.com/docs/how_to/routing/
- https://python.langchain.com/docs/concepts/prompt_templates/
- https://python.langchain.com/docs/concepts/text_splitters/
- https://python.langchain.com/docs/how_to/vectorstore_retriever/
- https://platform.openai.com/docs/guides/prompt-engineering
- https://platform.openai.com/docs/guides/embeddings
- https://platform.openai.com/docs/guides/text-generation
- https://platform.openai.com/docs/guides/streaming-responses
- https://docs.cohere.com/reference/rerank
- https://docs.cohere.com/docs/rerank-overview
- https://docs.cohere.com/reference/embed
- https://docs.voyageai.com/docs/embeddings
- https://faiss.ai/index.html
- https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexFlatIP.html
- https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexFlatL2.html
- https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexPQ.html
- https://faiss.ai/cpp_api/namespace/namespacefaiss.html
- https://github.com/pgvector/pgvector
- https://docs.pinecone.io/
- https://docs.pinecone.io/guides/indexes/use-namespaces
- https://qdrant.tech/documentation/
- https://qdrant.tech/documentation/guides/quantization/
- https://docs.weaviate.io/weaviate/search/hybrid
- https://docs.weaviate.io/weaviate/search/filters
- https://docs.weaviate.io/weaviate/concepts/data
- https://docs.weaviate.io/
- https://milvus.io/docs/overview.md
- https://docs.trychroma.com/
- https://lancedb.github.io/lancedb/
- https://www.elastic.co/docs/reference/elasticsearch/index-settings/similarity
- https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/
- https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/
- https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_relevancy/
- https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/
- https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_recall/
- https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_entities_recall/
- https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/noise_sensitivity/
- https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/aspect_critic/
- https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/factual_correctness/
- https://docs.confident-ai.com/
- https://phoenix.arize.com/
- https://docs.smith.langchain.com/
- https://langfuse.com/docs
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/
- https://opentelemetry.io/docs/specs/semconv/registry/attributes/gen-ai/
- https://owasp.org/www-project-top-10-for-large-language-model-applications/
- https://genai.owasp.org/llmrisk/llm01-prompt-injection/
- https://genai.owasp.org/llmrisk/llm02-sensitive-information-disclosure/
- https://genai.owasp.org/llmrisk/llm03-data-poisoning/
- https://genai.owasp.org/llmrisk/llm04-insecure-output-handling/
- https://genai.owasp.org/llmrisk/llm05-model-denial-of-service/
- https://redis.io/docs/latest/develop/data-types/strings/
- https://microsoft.github.io/graphrag/
- https://www.microsoft.com/en-us/research/project/graphrag/
- https://huggingface.co/BAAI/bge-m3
- https://huggingface.co/BAAI/bge-reranker-v2-m3
- https://huggingface.co/intfloat/e5-mistral-7b-instruct
- https://huggingface.co/learn/cookbook/multimodal_rag_using_document_retrieval_and_vlms
- https://github.com/stanford-futuredata/ColBERT
- https://github.com/AnswerDotAI/byaldi
- https://supabase.com/docs/guides/database/extensions/pgvector
- https://learn.microsoft.com/en-us/azure/developer/ai/advanced-retrieval-augmented-generation
- https://www.anthropic.com/news/contextual-retrieval
Academic Papers
- https://arxiv.org/abs/2005.11401
- https://arxiv.org/pdf/2005.11401
- https://arxiv.org/abs/2212.10496
- https://arxiv.org/pdf/2212.10496
- https://arxiv.org/abs/2310.11511
- https://arxiv.org/abs/2401.15884
- https://arxiv.org/abs/2501.09136
- https://arxiv.org/abs/2407.01449
- https://arxiv.org/abs/2403.14403
- https://arxiv.org/abs/2310.06117
- https://arxiv.org/abs/1603.09320
- https://arxiv.org/abs/2511.00444
- https://dl.acm.org/doi/10.1145/1571941.1572114
- https://dl.acm.org/doi/10.1145/290941.291025
- https://cormack.uwaterloo.ca/cormacksigir09-rrf.pdf
- https://aclanthology.org/2023.acl-long.99/
- https://openreview.net/forum?id=ogjBpZ8uSi
Technical Blogs & Tutorials
- https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
- https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/
- https://weaviate.io/blog/hybrid-search-explained
- https://www.postgresql.org/about/news/pgvector-080-released-2952/
- https://aws.amazon.com/blogs/database/optimize-generative-ai-applications-with-pgvector-indexing-a-deep-dive-into-ivfflat-and-hnsw-techniques/
- https://redis.io/blog/get-better-rag-responses-with-ragas/
- https://qdrant.tech/blog/rag-evaluation-guide/
- https://www.datadoghq.com/blog/llm-otel-semantic-convention/
- https://www.mongodb.com/docs/atlas/ai-integrations/langchain/parent-document-retrieval/
- https://www.sbert.net/docs/cross-encoder.html
- https://www.sbert.net/examples/applications/cross-encoder/README.html
- https://nlp.stanford.edu/IR-book/html/htmledition/query-expansion-1.html
- https://blog.premai.io/rag-chunking-strategies-the-2026-benchmark-guide/
- https://blog.premai.io/rag-evaluation-metrics-frameworks-testing-2026/
- https://blog.premai.io/best-embedding-models-for-rag-2026-ranked-by-mteb-score-cost-and-self-hosting/
- https://www.firecrawl.dev/blog/best-chunking-strategies-rag
- https://www.firecrawl.dev/blog/best-vector-databases
- https://callsphere.tech/blog/rag-architecture-patterns-2026-retrieval-augmented-generation
- https://rapidclaw.dev/blog/rag-architecture-ai-agents-guide-2026
- https://dev.to/young_gao/rag-is-not-dead-advanced-retrieval-patterns-that-actually-work-in-2026-2gbo
- https://encore.dev/articles/best-vector-databases
- https://procogia.com/unlocking-rags-potential-mastering-advanced-techniques-part-1/
- https://www.lancedb.com/blog/modified-rag-parent-document-bigger-chunk-retriever-62b3d1e79bc6
- https://www.meilisearch.com/blog/graph-rag
- https://graphrag.com/concepts/intro-to-graphrag/
- https://weaviate.io/blog/late-interaction-overview
- https://blog.gopenai.com/the-fidelity-crisis-in-rag-why-late-interaction-colbert-is-the-4k-image-of-search-vs-e978d96b25b8
- https://deepeval.com/blog/deepeval-vs-trulens
- https://www.getmaxim.ai/articles/the-5-best-rag-evaluation-tools-you-should-know-in-2026/
- https://rhesis.ai/post/best-llm-evaluation-testing-tools
- https://ragflow.io/blog/rag-review-2025-from-rag-to-context
- https://blog.starmorph.com/blog/rag-techniques-compared-best-practices-guide
GitHub Repositories & Code Examples
- https://github.com/langchain-ai/langchain
- https://github.com/run-llama/llama_index
- https://github.com/facebookresearch/faiss
- https://github.com/pgvector/pgvector
- https://github.com/microsoft/graphrag
- https://github.com/AnswerDotAI/byaldi
- https://github.com/stanford-futuredata/ColBERT
- https://github.com/PranavGovindu/Self-Corrective-Agentic-RAG
Video Resources
- https://www.youtube.com/watch?v=j66Db1SB1YY
- https://www.youtube.com/watch?v=nMCII_xtUbw
- https://www.youtube.com/watch?v=VfjIYjYFVt4
- https://www.youtube.com/watch?v=0fackgiKTiA
- https://www.youtube.com/watch?v=-1zMU1a625E
- https://www.youtube.com/watch?v=vT-DpLvf29Q