Context engineering is the systematic discipline of designing, structuring, and managing the information fed to Large Language Models (LLMs) and AI agents to optimize their performance, accuracy, and efficiency. Unlike prompt engineering, which focuses on crafting individual queries, context engineering operates at a systems level—managing the full data environment including memory, external knowledge, tool definitions, conversation history, and environmental signals. In 2026, Gartner declared this "The Year of Context," and the field has formalized four core strategies: write (save context outside the window), select (pull relevant context in), compress (reduce tokens while preserving signal), and isolate (split context across agents and environments). As the challenge shifts from "what can we fit?" to "what should we include and how should we organize it?", context engineering applies information architecture, relevance ranking, dynamic adaptation, and governance to ensure AI systems receive the right information at the right time—without overwhelming their attention budget or incurring excessive costs.
What This Cheat Sheet Covers
This topic spans 14 focused tables and 132 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Context Engineering Techniques
The foundational toolkit every context engineer uses—from dynamic retrieval to caching and compression—determines the quality and cost of every LLM interaction. Mastering these ten techniques before tuning advanced patterns yields the highest return on effort.
| Technique | Example | Description |
|---|---|---|
query = "What is RAG?"docs = retriever.get_relevant(query)context = f"Context: {docs}\nQuery: {query}" | • Dynamically retrieves relevant documents from external knowledge bases and injects them into the context window • reduces hallucination and enables up-to-date information without retraining | |
# Reuse prefix across requestscached_prefix = system_prompt + docsresponse = llm(cached_prefix + query) | Stores frequently reused context (e.g., system prompts, static docs) in cache to reduce latency by up to 80% and input token costs by up to 90%. | |
chunks = semantic_splitter.split( text, max_size=512, overlap=50) | • Divides documents into meaningful segments based on topic boundaries and semantic coherence rather than arbitrary character counts • improves retrieval precision | |
chunk_with_context = f"{doc_summary}\n{chunk}" | • Prepends each chunk with document-level context before embedding • reduces ambiguity and improves retrieval accuracy by up to 49%. | |
results = initial_retrieval(query)reranked = cross_encoder.rank( query, results)[:top_k] | • Applies a second-stage neural model (cross-encoder) to reorder retrieved documents by true relevance • significantly boosts precision over vector similarity alone | |
used = count_tokens(context)if used > max_tokens: context = truncate(context) | Manages the finite token budget by selecting, prioritizing, and structuring information to fit within model limits while preserving critical content. |