AI Agents Cheat Sheet

Updated 2026-05-28

Next Topic: AI Audio and Music Generation Cheat Sheet

🎯Take a practice test on this topic6 practice tests · 156 questions→

AI agents are autonomous systems built on large language models that can perceive their environment, reason through complex tasks, and take actions using external tools to achieve goals. Unlike traditional chatbots that simply respond to queries, agents operate through continuous think-act-observe loops, dynamically planning their next steps based on outcomes. The defining characteristic is tool use—agents don't just generate text; they execute functions, query databases, call APIs, and coordinate with other agents through standardized protocols like MCP, A2A, and AG-UI. This shift from prediction to execution makes agents the foundation of agentic AI, transforming LLMs from assistants into operational systems capable of multi-step workflows, self-correction, and long-horizon task completion. Understanding agent architecture—perception, reasoning engines, memory systems, and orchestration patterns—is essential for building reliable production agents in 2026. Frameworks like LangGraph, CrewAI, OpenAI Agents SDK, Google ADK, and PydanticAI provide production-ready primitives, while context engineering has emerged as a first-class discipline for controlling what an agent sees at each step, and evaluation tools like DeepEval and LLM-as-Judge enable systematic quality measurement.

Quick Index144 entries · 19 tables

Mind Map

19 tables, 144 concepts. Select a concept node to jump to its table row.

Preparing mind map...

Table 1: Core Agent Concepts

These are the building blocks every AI agent shares regardless of framework: a control loop that thinks-acts-observes, structured tool invocation, a reasoning engine, perception of inputs, and a calibrated level of autonomy. Grasp these and the framework-specific patterns in later tables become small variations on the same theme.

Concept	Example	Description
Agent Loop	`while not done:` `thought = think(observation)` `action = decide(thought)` `observation = execute(action)`	• Continuous think-act-observe cycle where the agent reasons about current state, takes an action, observes the result, then decides the next step • the foundational execution pattern popularized by ReAct, where each iteration depends on the latest observation rather than a fixed plan.
Tool Use	`tools = [calculator, web_search, db_query]` `agent.invoke("What's 2^16?", tools)`	• Broad pattern where the agent invokes external functions or APIs beyond text generation, with the runtime (not the LLM) executing the call and feeding the result back • tools are defined with name, description, and JSON schema for parameters.
Function Calling	`{"name": "get_weather", "args": {"city": "NYC"}}`	• LLM-side mechanism that outputs structured JSON specifying which function to call and with what arguments • the model only proposes the call; the application or agent runtime is what actually executes it and returns the result.
Agentic Workflow	Goal → Plan → Execute → Observe → Adjust → Repeat	• Dynamic, goal-driven process where the agent decomposes the task, takes actions, evaluates outcomes, and adapts its strategy • distinct from a linear prompt chain, which follows fixed predefined steps and is a workflow, not an agent.
Perception	`raw_input = {"text": query, "context": session}` `structured_data = parse(raw_input)`	• Transforms raw inputs (text, API responses, sensor data) into structured representations the reasoning engine can act on • includes parsing, normalization, filtering noise, and entity extraction — more than just receiving text.
Reasoning Engine	LLM + prompting strategy (ReAct/CoT) + memory + planning	• Core decision-making component that weighs inputs, selects actions, and generates plans for the agent • typically an LLM operating inside a scaffold of memory, planning, and prompting strategies — the agent is the orchestrated system, not the bare LLM.
Action Space	`["send_email", "query_db", "http_get", "file_write"]`	• Set of available tools and operations the agent is allowed to invoke • defines the boundary of what the agent can accomplish in its environment; tools outside this set cannot be selected by the LLM.
Computer-Using Agent (CUA)	`agent.screenshot()` `agent.click(x=320, y=240)` `agent.type("search query")`	• Agent uses multimodal vision to interpret a screen then plans and executes keyboard/mouse actions, enabling control of any GUI without a dedicated API • pure vision-based CUAs (OpenAI CUA, Azure Responses API) handle any OS or app; DOM-aware variants (browser-use, UFO) add structural hints for higher accuracy on constrained targets.
Autonomy	Agent decides when to stop vs. asks user for approval per action	• Degree to which the agent makes decisions without human intervention, measured as a spectrum rather than a binary • ranges from fully autonomous to selective human-in-the-loop (HITL) gates on high-stakes actions — the dominant production pattern in 2026.

Table 2: Architecture Patterns

These patterns describe how a single agent system organizes reasoning, decision-making, and control flow. The biggest source of confusion is what happens WHEN: ReAct decides one step at a time, Plan-and-Execute commits to a full plan upfront, Reflection critiques its own output, and Tree-of-Thought explores branches in parallel before committing.

Pattern	Example	Description
ReAct	`Thought: Need price data` `Action: query_db("SELECT price")` `Observation: $42` `Thought: Now calculate...`	• Reasoning and Acting interleaved • agent generates Thought explaining reasoning, executes Action, receives Observation of result, then generates next Thought • think-act-observe pattern improves reliability and debuggability.
Reflection	`output = generate()` `critique = evaluate(output, criteria)` `if not good: revise(critique)`	• Agent self-evaluates outputs against quality criteria, then iterates to improve • generates, critiques, and refines until meeting standards or max attempts.
Evaluator-Optimizer	`output = generator.propose(task)` `feedback = evaluator.critique(output, criteria)` `if not done: generator.revise(feedback)`	• Generator produces output, evaluator critiques against criteria, loop repeats until quality threshold met • enables autonomous quality assurance via two cooperating agents instead of one.
Planning	Goal → Subtasks → Order → Execute	• Agent decomposes complex goals into actionable subtasks, determines execution order, then coordinates completion • enables long-horizon task solving.
Chain-of-Thought	`Let's think step-by-step:` `1. Parse question` `2. Identify data needed` `3. Calculate result`	• Prompts agent to show intermediate reasoning steps • improves accuracy on complex tasks by making thought process explicit before answering.
Router	`if query_type == "sql":` `route_to(sql_agent)` `else:` `route_to(general_agent)`	• Conditional dispatcher that directs requests to specialized agents or tools based on content analysis • enables modular, expert-based architectures.
Orchestrator-Worker	`supervisor.assign(task, worker_agents)` `results = await all_workers()` `supervisor.synthesize(results)`	• Hierarchical delegation where supervisor breaks work into parallel subtasks, assigns to worker agents, aggregates results • scales complex workflows efficiently.
Agent Handoff	`handoff(source=triage_agent,` `target=billing_agent,` `condition="billing question")`	• One agent transfers control to another specialized agent mid-conversation • enables seamless delegation where the receiving agent continues with full context.
Plan-and-Execute	`plan = planner.create(goal)` `for step in plan:` `execute(step)`	• Separates planning from execution • planner creates full task breakdown upfront, then executor runs each step • clearer than pure ReAct for multi-step workflows.
Tree-of-Thought	Explore multiple reasoning paths in tree structure, backtrack if needed	• Agent explores multiple solution paths simultaneously, evaluating each branch before committing • useful when single linear reasoning path may fail.

Table 3: Memory Systems

Memory turns stateless LLMs into agents that learn across sessions. This table maps the canonical short-term vs long-term split, the three cognitive long-term types (episodic, semantic, procedural), production runtimes like Mem0, Letta (MemGPT) and Zep/Graphiti, and the working/shared memory layers that coordinate state within and across agents.

Type	Example	Description
Short-Term Memory	Current conversation context in prompt	• Thread-scoped context for the active session • holds recent messages and intermediate outputs inside the model's context window; cleared when the thread ends.
Long-Term Memory	Vector DB storing past interactions	• Persistent storage across threads and sessions • agent retrieves relevant historical context to inform current decisions • critical for maintaining user preferences and learning.
Mem0	`mem0.add(messages, user_id="u1")` `results = mem0.search("user prefs", user_id="u1")`	• Dedicated memory layer that dynamically extracts, consolidates, and retrieves salient facts from conversations across sessions • achieves 26% relative improvement over OpenAI's memory and 91% lower p95 latency vs full-context; integrates with 21 frameworks and 20 vector stores.
Zep/Graphiti	`graphiti.add_episode(messages)` `results = graphiti.search("user prefs",` `center_node_uuid)`	• Temporal knowledge graph memory storing facts as edges with bi-temporal validity windows • outperforms vector-only approaches on cross-session temporal reasoning (94.8% vs 93.4% on DMR; up to 18.5% gain on LongMemEval).
Episodic Memory	`"On 2026-01-15, user preferred JSON output"`	• Stores specific past events tied to a time and place • lets the agent recall "what happened when" instead of only "what is true in general."
Semantic Memory	`"User always wants reports in PDF format"`	• Stores durable facts, preferences, and learned knowledge that are true in general • extracted and consolidated from episodic traces into atomic, deduplicated entries.
Procedural Memory	Learned workflows like "how to file a bug report"	• Encodes how to perform tasks — communication styles, formatting rules, action sequences • captured from feedback and reused as the agent's default behavior.
Letta (MemGPT)	`agent = create_agent(model="openai/gpt-4o")` `agent.send_message("Prefer JSON output")`	• OS-inspired virtual context management with tiered memory (core in-context blocks, recall, archival storage) • the agent self-edits its own memory blocks via tool calls (`memory_insert`, `memory_replace`).
Working Memory	Variables tracking current task state	• Active scratchpad for the current task • holds intermediate results, loop counters, and task-specific variables that the agent manipulates while reasoning — distinct from raw conversation history.
Shared Memory	`redis.set("team_context", state)` `other_agent.get("team_context")`	• Cross-agent state enabling coordination • multiple agents read/write a common store (Redis, shared DB) so they stay aligned in multi-agent systems.

Table 4: Multi-Agent Systems

This table covers the common topologies multi-agent systems use to coordinate work — when control is centralized, when it isn't, and how agents share state. The split between hierarchical (a supervisor delegates downward) and network/collaborative (peers interact as equals) is the most-confused distinction, and how state is shared (direct messages vs a shared blackboard) is the second.

Pattern	Example	Description
Hierarchical	`supervisor → specialist_a, specialist_b → workers`	• A central supervisor delegates tasks to specialized agents, which may themselves supervise sub-teams • communication and routing flow through the supervisor, results aggregate upward • mirrors an org chart and makes ownership and debugging easier than a flat mesh.
Agent-as-Tool	`orchestrator.tools = [agent_a.as_tool(), agent_b.as_tool()]`	• A sub-agent is wrapped as a callable tool so an orchestrating agent invokes it via standard function calling • unlike handoffs (which transfer control), agent-as-tool invocations return a result to the calling agent, which retains control; preferred when the caller needs the sub-result before deciding its next step.
Collaborative	Agents share a scratchpad of messages and discuss until one says `FINAL ANSWER`	• Peer agents interact as equals to solve a problem jointly, with no fixed authority • they exchange messages, debate, and negotiate solutions until they reach consensus.
Blackboard	`board.write("draft", text)` then a reviewer agent triggers when it sees a new draft and writes back `board.write("review", feedback)`	• Agents read and write to a shared workspace rather than messaging each other directly • each agent watches the board and activates when relevant data appears, enabling loose coupling and emergent coordination without a central orchestrator.
Sequential	`Agent A → Agent B → Agent C` (pipeline)	• Agents are chained in a predefined linear order, each processing the previous agent's output • suited to step-by-step workflows where each stage depends on the one before it.
Parallel (Concurrent)	Three agents analyze the same stock concurrently, then a final step merges their answers	• Multiple agents work on independent subtasks at the same time, with no agent reading another's in-progress output • a coordinator aggregates results when they finish, cutting overall latency for decomposable work.
Network	Any agent can call any other agent's tool directly; routing is decided at runtime	• A decentralized many-to-many mesh where any agent can communicate with any other and decide which agent runs next • no fixed hierarchy or predefined order — flexible but harder to debug than supervised topologies.

Table 5: Communication Protocols

Agents rarely work alone — they reach out to tools and data, hand off work to other agents, and stream results back to users. This table covers the three protocols that now define those boundaries (MCP, A2A, AG-UI) along with the classic messaging patterns (Pub-Sub, Request-Response, Message Queue) that still underpin agent transports.

Protocol	Example	Description
Model Context Protocol (MCP)	`mcp_server.list_tools()` `mcp_server.call_tool("get_data", args)`	• Open standard originated by Anthropic (November 2024) for connecting LLM applications to external tools, resources, and prompts • uses a host–client–server architecture over JSON-RPC 2.0 with Streamable HTTP transport (HTTP+SSE deprecated since the March 2025 spec revision) • the canonical agent-to-tools/data layer, distinct from agent-to-agent.
Agent-to-Agent (A2A)	`agent_a.send(agent_b, message)` `response = agent_b.process_and_reply()`	• Inter-agent communication protocol originated by Google, now donated to the Linux Foundation • agents publish agent cards at `/.well-known/agent-card.json` and exchange task-lifecycle messages to delegate work • IBM's ACP (Agent Communication Protocol) was officially incorporated into A2A under the Linux Foundation in August 2025.
AG-UI	`agent.emit(TextMessageStart(id))` `agent.emit(ToolCallStart(id, name))`	• Open, event-based protocol for the agent ↔ user-interface boundary, born from CopilotKit's work with LangGraph and CrewAI • streams text chunks, tool calls, state updates, and human-in-the-loop events over SSE or WebSocket • natively supported by Amazon Bedrock AgentCore Runtime, LangGraph, CrewAI, Microsoft Agent Framework, Google ADK, and others.
Pub-Sub	`agent.subscribe("topic/events")` `publish("topic/events", data)`	• One-to-many broadcast pattern — every subscriber to a topic receives its own copy of each message • publishers don't know how many subscribers exist, which decouples senders from receivers • natural fit for event-driven fan-out across many agents.
Request-Response	`response = await agent.call(request)`	• Synchronous query-reply pattern — the caller blocks until the callee returns a response • simplest model with strong consistency and easy debugging, but produces tight runtime coupling and risks cascading failures • the baseline pattern HTTP / REST inherits.
Message Queue	`queue.push(task)` `worker = queue.pop()` `worker.execute(task)`	• Point-to-point asynchronous delivery — each message is consumed by exactly one worker, with FIFO ordering inside the queue • decouples producers from consumers and buffers spikes, with built-in retry and dead-letter handling • the workhorse pattern for background work distribution (e.g. SQS, RabbitMQ queues).

Table 6: Agent Frameworks

The agent-framework landscape in 2026 is crowded, and each tool below picks a different bet: graph-based stateful runtimes, role-based crews, conversational multi-agent systems, type-safe structured output, model-driven SDKs, or lightweight code-first harnesses. Use this table to match a project's needs (durability, model-agnosticism, multi-agent style) to the framework whose design philosophy fits best.

Framework	Example	Description
LangGraph	`StateGraph` with nodes, edges, and checkpointers	• Graph-based orchestration for stateful, cyclical workflows • models agents as state machines with conditional routing and durable persistence via checkpointers • production-grade.
LangChain	`create_agent(model, tools=[...])` with chains, tools, memory	• Flexible toolkit for building LLM applications • provides abstractions for prompts, tools, memory, and a `create_agent` entry point • code-first with extensive integrations.
CrewAI	`Crew` of agents with `role`, `goal`, `backstory`	• Role-based collaboration where agents simulate team dynamics • supports `Process.sequential` and `Process.hierarchical` (requires a `manager_llm` or `manager_agent`) • fast prototyping for multi-agent workflows.
OpenAI Agents SDK	`Agent(name="Assistant", tools=[...])` `Runner.run_sync(agent, query)`	• Official OpenAI framework, the production-ready replacement for the experimental Swarm library • core primitives: Agents, Handoffs, Guardrails, and Agent-as-Tool patterns (plus Sessions and Tracing) • Python-first with built-in tracing and MCP support.
AutoGen	`ConversableAgent` with multi-agent conversations	• Conversational agents that communicate via message passing • emphasizes agent-to-agent dialogue and group chats for task solving • Microsoft-backed.
Claude Agent SDK	`async for msg in query(` `prompt="Fix the bug",` `options=ClaudeAgentOptions())`	• Anthropic's official SDK exposing the same agent harness that powers Claude Code as a library • built-in tools for file reading, command execution, and web search • available in Python and TypeScript.
Google ADK	`SequentialAgent`, `ParallelAgent`, `LoopAgent`, `LlmAgent`	• Google's modular framework with workflow agents (Sequential, Parallel, Loop) for deterministic flow and `LlmAgent` for LLM-driven dynamic routing • model-agnostic • multi-language (Python, Go, Java, TypeScript).
PydanticAI	`agent = Agent('openai:gpt-5.2',` `output_type=MyModel)`	• Type-safe Python framework by the Pydantic team, FastAPI-like for GenAI • structured output with automatic Pydantic validation, MCP/A2A/AG-UI integration • built-in evals and Logfire observability.
Semantic Kernel	`kernel.add_plugin(MyPlugin(), plugin_name="X")` plus planners	• Microsoft framework optimised for enterprise scenarios • tight Azure integration, C# / Python / Java support • emphasises plugins as reusable skills the kernel and planners can invoke.
Strands Agents	`agent = Agent(model=BedrockModel(...), tools=[...])`	• AWS open-source SDK with a model-driven approach • model-agnostic, supporting Bedrock, Anthropic, OpenAI, Gemini, and Ollama with one-line provider swaps • native MCP support, tool hot-reloading, and a Swarm multi-agent pattern.
smolagents	`agent = CodeAgent(tools=[tool], model=model)` `agent.run("What is the weather?")`	• Hugging Face's lightweight agent library built around code-first actions • `CodeAgent` writes and executes Python snippets as actions (must be sandboxed); `ToolCallingAgent` is the JSON-tool-call alternative • supports local and remote models.
Agno	`agent = Agent(model=OpenAIChat(), tools=[...])` `agent.print_response("Summarize this")`	• Fast, lightweight framework for building multi-modal agents • supports memory, knowledge, reasoning, and teams • headline claim: agent instantiation on the order of microseconds, far faster than heavier frameworks.
Claw Code	`git clone github.com/ultraworkers/claw-code` `./target/debug/claw prompt "Refactor auth"`	• Open-source Rust-based CLI agent harness (April 2026) inspired by Claude Code • multi-provider support (Anthropic, xAI, OpenAI-compatible, DashScope) with a tiered permission system and session persistence • build-from-source only (the `cargo install` stub is deprecated).

Table 7: Tool Integration

Function calling tells you that an LLM can ask for a tool — this table covers the runtime plumbing that actually makes tools work. It maps the lifecycle from defining the schema, discovering what's available, executing the call, parsing the result, and scheduling multiple calls in parallel or in sequence, including the efficient code-execution pattern for large MCP server ecosystems.

Technique	Example	Description
Tool Schema	`{"name": "get_weather",` `"description": "...",` `"parameters": {...}}`	• JSON-Schema description of the tool's name, when to use it, and the shape of its parameters • the model reads this metadata to decide whether and how to invoke the tool — implementation code lives in your runtime, never in the schema.
Tool Discovery	`tools = mcp_client.list_tools()`	• Runtime enumeration of available tools via `tools/list` over JSON-RPC • lets agents pick up new capabilities without redeployment, but every advertised schema costs context tokens.
Tool Execution	`name, args = parse(tool_call)` `result = tools[name](**args)`	• Agent runtime invokes the function the LLM selected, parsing the call name and JSON arguments • the LLM only generates the call; if your code skips dispatch, the model fabricates an observation instead.
Structured Output	`response_format={"type":` `"json_schema", "strict": true,` `"schema": {...}}`	• Constrained decoding masks any token that would violate the supplied JSON Schema, guaranteeing schema-conformant output • stronger than the older JSON mode (which only guarantees valid JSON syntax) • supported by OpenAI, Anthropic, Google, and AWS Bedrock; guarantees shape, not semantic correctness.
Parallel Tool Use	`calls = [get_weather("NYC"),` `get_weather("LA")]` `results = await asyncio.gather(*calls)`	• The model emits multiple tool calls in one turn, and the runtime dispatches them concurrently • works only when calls share no data dependency — hidden dependencies (e.g. fetch-then-update on shared state) create race conditions.
Tool Chaining	`user_id = get_user(name)` `orders = get_orders(user_id)`	• Sequential composition where the output of tool A feeds the input of tool B • the data dependency forces serial execution and is the structural opposite of parallel tool use.
Tool Result Parsing	`obs = clean(tool_output)` `messages.append(` `{"role": "tool", "content": obs})`	• Converts raw tool output into a clean observation message the LLM can reason over • typically JSON-stringifies structured data and strips noise like rate-limit headers, request IDs, and other internal metadata.
Code Execution via MCP	`import * as gdrive from './servers/google-drive'` `const doc = await gdrive.getDocument({documentId})`	• Presents MCP servers as code APIs on a filesystem rather than direct tool calls — the agent reads only the tool definitions it needs and processes data in a code execution environment before returning results to the model • reduces token usage by up to 98.7% by avoiding upfront loading of all tool schemas and keeping intermediate results out of the model context window.

Table 8: State Management

State management is what lets an AI agent survive crashes, hand off conversations between sessions, and explore alternative trajectories without losing the original run. The patterns below cover the two dominant approaches in production today — LangGraph's checkpointer-based persistence and Temporal's event-sourced durable execution — and the supporting concepts (threads, typed schemas, rollback) that make either approach safe at scale.

Concept	Example	Description
Checkpointing	`graph.compile(checkpointer=PostgresSaver(...))` `graph.get_state(config)`	• Saving agent state as a snapshot at every super-step boundary • enables pause/resume, time-travel debugging, and recovery from node failures • critical for long-running agents.
State Persistence	`PostgresSaver.from_conn_string(DB_URI)` `SqliteSaver`, `RedisSaver`	• Durable storage of agent state across process restarts • production agents swap `InMemorySaver` for Postgres/SQLite/Redis backends so threads survive crashes and redeploys.
Thread Management	`config = {"configurable": {"thread_id": "user-123"}}` `graph.invoke(input, config)`	• Isolating parallel agent sessions • `thread_id` is the primary key under which checkpoints are stored, giving each user or session independent state with no crosstalk.
State Schema	`class AgentState(TypedDict):` `messages: Annotated[list, add_messages]` `step_count: int`	• Typed definition of agent state structure (TypedDict, dataclass, or Pydantic BaseModel) • reducers like `add_messages` tell LangGraph how to merge updates — without them, the last write wins.
Durable Execution	`@workflow.defn` `class AgentWorkflow:` `async def run(self): ...`	• Agent workflows that survive crashes and restarts via platforms like Temporal • recovery works by replaying the recorded Event History against deterministic workflow code — not by loading a saved state blob.
State Rollback	`fork_cfg = graph.update_state(old_cfg, {"x": "new"})` `graph.invoke(None, fork_cfg)`	• Forking from an earlier checkpoint with modified state to explore an alternative path • `update_state` is non-destructive — the original history is preserved, enabling safe "undo" and experimentation.

Table 9: Execution Patterns

How an agent is invoked controls its throughput, perceived latency, and integration shape. These five patterns — synchronous, asynchronous, streaming, event-driven, and batch — solve different problems and frequently get confused with one another, especially async vs parallel and streaming vs async.

Pattern	Example	Description
Synchronous Execution	`result = agent.run(query)` `print(result)`	• Blocking call that waits for agent completion before returning • simpler to reason about but locks the caller until the full response is ready.
Asynchronous Execution	`task = asyncio.create_task(agent.run(query))` `result = await task`	• Non-blocking invocation that yields control while the agent works • concurrency (single-threaded event loop), not parallel execution; lets the caller interleave other I/O.
Streaming	`async for token in agent.stream():` `print(token, end="")`	• Agent emits partial outputs (token deltas via SSE) as they are generated • cuts Time-to-First-Token and improves perceived latency without changing total generation time.
Event-Driven	`agent.on("tool_call", log_callback)` `agent.on("error", retry_callback)`	• Callback-based execution where typed events (run lifecycle, tool calls, state deltas, errors) fire registered handlers • canonical pattern in the AG-UI protocol for agent-UI integration.
Batch Processing	`batch = client.messages.batches.create(requests=[...])` `# poll, then read results`	• Submit many requests in one job processed asynchronously over up to 24 hours • 50% cheaper on OpenAI and Anthropic; suited to evals, classification, bulk generation.

Table 10: Reasoning Techniques

Reasoning techniques shape how an agent thinks through a problem before answering — from solving with no examples (zero-shot) to ensembling many sampled reasoning paths (self-consistency) to looping with the environment (agentic reasoning). The right choice depends on the task's complexity, the cost budget, and whether the agent needs to ground itself in external observations.

Technique	Example	Description
Zero-Shot	`"Translate this to French: Hello"`	• Agent solves the task with no examples in the prompt, relying solely on instruction and pre-training • no weight updates — distinct from fine-tuning • fastest and cheapest, but less reliable for complex or ambiguous tasks.
Few-Shot	`Examples:` `Q: 2+2 A: 4` `Q: 3+5 A: 8` `Now: 7+9 = ?`	• A handful of input-output example pairs are placed in the prompt before the actual query • the model conditions on these demonstrations in-context — no gradient updates (Brown et al., GPT-3, 2020) • teaches task format and output style purely through demonstration.
Self-Consistency	`Run the same CoT prompt 5 times with temperature > 0, return the majority answer`	• Samples multiple independent chain-of-thought paths for the same question, then returns the answer that appears most often (Wang et al., 2022) • replaces greedy decoding with sample-and-vote aggregation • boosts hard-reasoning accuracy (e.g. +17.9% on GSM8K) at roughly N× the token cost.
Graph-of-Thought	`Nodes = LLM thoughts, edges = dependencies; aggregate, refine, loop`	• Generalizes Tree-of-Thought to an arbitrary graph of thoughts (Besta et al., 2023), enabling aggregation across branches, cycles, and feedback-loop refinement • unlike a tree, separate reasoning paths can be merged into a synergistic thought • trades higher orchestration cost for quality gains on elaborate problems.
Agentic Reasoning	`Thought → Action → Observation → Thought → … (ReAct loop)`	• Closed-loop reasoning where the agent thinks, takes an action (tool call, API), observes the outcome, and adjusts the next thought • canonical instantiation is ReAct (Yao et al., 2022) — interleaved thought-action-observation traces • grounds the agent in the environment, which sharply reduces the fact-hallucination that pure chain-of-thought suffers from.

Table 11: Planning Strategies

Five strategies an agent uses to turn a goal into action. They differ in when the plan is built, how it is structured, and what the agent does when reality fails to cooperate — from a one-shot decomposition into subtasks, through multi-level hierarchies, to runtime replanning and the upfront Planner-Worker-Solver split that ReWOO introduces.

Strategy	Example	Description
Task Decomposition	`"Write report" → ["research", "outline", "draft", "edit"]`	• Breaking complex goal into subtasks • agent identifies logical steps required to achieve objective, each simpler than original.
Hierarchical Planning	High-level plan → Detailed sub-plans for each step	• Multi-level decomposition where agent plans at multiple granularities • top-level strategy refined into tactical execution steps.
Dynamic Replanning	Adjust plan when action fails or new info appears	• Agent updates strategy based on execution results • abandons unsuccessful paths and generates new plans in response to changing conditions.
Contingency Planning	If primary approach fails, execute backup plan	• Creating alternative strategies upfront • agent has predefined fallbacks for anticipated failure modes.
ReWOO	Planner generates full tool-use plan upfront without intermediate observations	• Reasoning WithOut Observation — separates planning from tool execution • planner creates complete action sequence before any tool is called, reducing redundant LLM calls • more token-efficient than ReAct for predictable workflows.

Table 12: Error Handling

Production agents fail constantly: providers throttle, networks blip, models time out, and downstream tools return garbage. These five patterns are the standard distributed-systems toolkit applied to LLM agents — retry transient errors with backoff and jitter, fall back to alternatives, trip a circuit breaker before a retry storm crashes a recovering provider, degrade gracefully when full output is impossible, and propagate unrecoverable errors up to a supervisor.

Technique	Example	Description
Retry with Backoff	`for attempt in range(3):` `try: call_api()` `except: sleep(2**attempt)`	• Automatic retry with exponentially increasing delays • handles transient failures like rate limits or network glitches.
Fallback Strategies	`try: use_gpt4()` `except: use_gpt35()`	• Alternative approaches when primary fails • agent switches to backup model, tool, or method if first choice unavailable.
Circuit Breaker	After N failures, stop trying for cooldown period	• Prevents cascading failures • temporarily disables failing service to allow recovery rather than overwhelming it with retries.
Graceful Degradation	Return partial results when full task impossible	• Agent completes what it can even when encountering errors • provides best-effort output rather than total failure.
Error Propagation	Pass error context upward in multi-agent hierarchy	• Bubbles failures to supervisor agents who can make recovery decisions • maintains error visibility while delegating handling.

Table 13: Evaluation & Testing

Metrics, frameworks, and benchmark suites that production teams use to answer two distinct questions about an agent: did it succeed? and did it succeed for the right reasons? Outcome-style measures (Task Success Rate, SWE-bench, GAIA) sit alongside trajectory-style measures (Trajectory Analysis, Tool Accuracy) and the human-in-the-loop and LLM-as-Judge graders that score everything in between.

Metric	Example	Description
Task Success Rate	`successful_tasks / total_tasks`	• Headline outcome metric — fraction of evaluation tasks the agent completes against the success criteria • lets you compare agent versions and benchmark sizes apples-to-apples.
Trajectory Analysis	Evaluate the reasoning path and tool calls, not just the final answer	• Inspects the full transcript: reasoning steps, tool calls, intermediate state • catches agents that pass via lucky paths and reveals why failures happen, not just that they happened.
Tool Accuracy	`correct_tool_calls / total_tool_calls`	• Action-layer metric: did the agent select the right tools with the right arguments? • foundational for tool-using agents — poor tool selection cascades into everything else.
Hallucination Rate	`fabricated_facts / total_statements`	• Frequency of invented information not supported by the provided context • measured against ground truth or a retrieval context; lower is better.
LLM-as-Judge	`judge_llm.score(output, rubric)`	• Use a stronger LLM to grade outputs against a rubric or pick a winner in a pairwise comparison • scales human evaluation but inherits judge biases like position, verbosity, and self-enhancement.
SWE-bench	Resolve real GitHub issues from popular Python repos	• Standard benchmark for coding agents — an agent passes only if its patch makes the hidden test suite go from failing to passing • SWE-bench Verified is near-saturated in 2026 (top agents ~100%); SWE-bench Pro on Scale AI's SEAL leaderboard is the new standard (multi-language, harder harness — same top agent drops to ~46%) • data contamination on Verified led OpenAI to stop reporting those scores.
GAIA	466 real-world tasks mixing web browsing, file parsing, multi-document reasoning; top agents ~75% in 2026	• General AI Assistants benchmark that chains tool use, web browsing, and reasoning across three difficulty levels • progress from ~20% (2023) to 74.5% in early 2026; the same model scores up to 7 points differently across orchestration frameworks — the scaffold, not just the model, determines results.
τ-bench	Agent chats with LLM-simulated user AND calls tool APIs; pass^k measures reliability across k re-runs	• Real-world conversational agent benchmark with domain-specific policies (retail and airline) testing multi-turn interaction, tool use, and rule-following • measures reliability via pass^k — top models drop from ~45–71% on pass^1 to ~25% on pass^8, revealing production unreliability hidden by single-run averages.
DeepEval	`assert_test(test_case, [ToolCorrectnessMetric()])`	• Open-source, pytest-style LLM evaluation framework (Apache 2.0) • ships deterministic metrics (Tool Correctness) alongside LLM-as-Judge metrics (G-Eval, Hallucination, RAGAS, Task Completion).
HAL (Holistic Agent Leaderboard)	Princeton's standardized, cost-aware leaderboard covering GAIA, SWE-bench, WebArena, TAU-bench, and more	• Unified cost-aware evaluation harness (accepted ICLR 2026) for reproducible comparison across benchmarks and agent frameworks • tracks cost-performance Pareto frontier; agents can be 100× more expensive for only 1% accuracy gain — a one-dimensional leaderboard hides this.
Human Feedback	Expert review of agent transcripts or user satisfaction ratings	• Gold-standard grading for subjective qualities like helpfulness, tone, and edge-case judgment • expensive and slow, so often reserved for calibrating LLM judges and spot-checking.
Benchmark Datasets	AgentBench (OS/DB/web tasks), WebArena (web automation), MMLU (knowledge), HumanEval (code)	• Standardised public test sets that enable apples-to-apples comparison across models • each measures something specific — one benchmark alone never proves general capability; WebArena success rates rose from 15% (2023) to 74.3% (2026); saturated benchmarks lose signal.

Table 14: Observability & Debugging

Without observability, multi-step agent failures are nearly un-debuggable: each LLM call, tool invocation, and sub-agent decision happens behind the model's reasoning, and a flat log won't tell you which step caused the wrong answer. This table covers the layered toolkit teams reach for in production — tracing for causal execution trees, logging and real-time monitoring for what's happening now, dedicated platforms like LangSmith, Langfuse, and Laminar, callback hooks for low-overhead instrumentation, and replay for reproducing intermittent bugs from saved state.

Tool	Example	Description
Tracing	`@traceable` `def assistant(q): ...`	• Captures the execution tree as parent-child spans across every LLM call, tool use, and sub-agent step • reveals causality and timing that flat logs cannot.
Logging	`logger.info(f"Agent chose tool: {tool_name}")`	• Time-stamped records of discrete events written to a persistent store • useful for post-mortem analysis and compliance auditing, but lacks span-level causality.
Real-Time Monitoring	Dashboard showing trace count, latency p50/p99, error rate, cost	• Live production visibility with prebuilt panels for traces, LLM calls, tools, and costs • threshold alerts fire when error rate or latency cross configured limits.
LangSmith	`export LANGSMITH_TRACING=true` `# traces auto-captured`	• LangChain's hosted observability platform with LangGraph Studio IDE for visual agent debugging, 1-click deployment, and zero-setup tracing for LangChain/LangGraph apps • full OTel support as of March 2026; self-hosting is Enterprise only.
Langfuse	`@observe(as_type="agent")` `def run_agent(q): ...`	• Open-source LLM engineering platform (MIT) from an independent team, not LangChain • framework-agnostic via OpenTelemetry with first-class free self-hosting; strong evaluation, prompt management, and dataset workflows.
Laminar	`@observe(name="agent")` `async def run_agent(q): ...` `# session replay synced to trace`	• Real-time agent debugging platform built around a span-tree causal model and first-class Replay workflows • browser-agent session replay is synced to traces — useful for debugging what a CUA actually saw; native OTel ingestion and data-volume pricing distinguish it from LangSmith and Langfuse.
Callback Handlers	`on_llm_start`, `on_tool_end`, `on_error`	• Observer-only event hooks triggered at lifecycle points without modifying chain logic • attach via `RunnableConfig` for per-request scoping.
Replay	`graph.invoke(None, checkpoint_config)`	• Re-executes nodes from a saved checkpoint to reproduce a bug or test a fix • in LangGraph, `get_state_history` lists checkpoints you can replay or fork from.

Table 15: Retrieval-Augmented Generation (RAG) for Agents

RAG techniques connect an agent to external knowledge so its answers can be grounded in fresh, domain-specific facts instead of memorized weights. The patterns below stack: query transformation reshapes the input, vector or graph retrieval pulls candidates, reranking sharpens them, and an agentic controller decides when any of this is even worth running.

Technique	Example	Description
Vector Search	`embeddings = embed(query)` `results = vector_db.search(embeddings, k=5)`	• Semantic retrieval of relevant documents using embedding similarity (e.g. cosine, dot product) • agent queries a vector store to augment reasoning with external facts.
Agentic RAG	Agent decides when to retrieve, what to query, how to use results	• Agent controls retrieval rather than always fetching upfront • reasons step-by-step about necessity, formulates queries, may iterate or skip retrieval entirely for simple questions.
Query Transformation	Original query → hypothetical answer (HyDE) → embed and retrieve against that	• Pre-retrieval step that rewrites or expands the query to close the query–document vocabulary gap • includes HyDE, multi-query, and step-back prompting.
Reranking	`candidates = retrieve(query, k=20)` `top_results = reranker.rank(candidates, k=5)`	• Post-retrieval step that re-scores candidates using a cross-encoder (or LLM judge) that sees query and document jointly • lifts precision after high-recall vector retrieval.
Graph RAG	Query a knowledge graph for entity relationships and community summaries	• Retrieves structured knowledge from a graph of entity nodes and relationship edges • supports multi-hop reasoning that flat vector similarity cannot, with substantial gains in answer comprehensiveness on global questions (Microsoft Research, 2024).

Table 16: Context Management

Every agent runs on a finite token budget — what you feed the model, what it has said so far, what it retrieved, and what it generates all share the same window. This table covers the techniques that decide what stays, what gets compressed, and what gets cached so an agent can run for hours without the model losing focus or burning your budget on the same tokens twice.

Technique	Example	Description
Context Window	Claude Opus 4.6 / Sonnet 4.6: 1M tokens (GA); Gemini 2.5 Pro: 1M tokens	• All tokens the model can reference in one call, including the response it generates • covers system prompt, history, retrieved docs, current query, and output • 1M-token windows are now generally available from Anthropic (Opus 4.6 / Sonnet 4.6) and Google.
Context Overflow Handling	Stop with `model_context_window_exceeded`, then summarize and continue	• Strategies for the hard token limit • modern APIs return an explicit error or stop reason at the boundary rather than silently dropping tokens • common responses are truncation, summarization (compaction), or splitting the work across calls.
Prompt Caching	`cache_control: {type: "ephemeral"}` on a static system prompt	• Reuses a static prompt prefix across requests by storing its processed tokens server-side • cache hits require an exact (hash-level) match of the prefix up to the breakpoint • default TTL is 5 minutes (Anthropic); cache reads are billed at a fraction of fresh input tokens; in production delivers ~90% cost reduction on cached tokens.
Semantic Caching	Embed query; if cosine similarity to a stored query exceeds the threshold, return the cached answer	• Reuses a prior response when a new query is semantically close to one already answered • matches by embedding similarity, not by exact text — so the model is skipped entirely on a hit • research shows ~31% of LLM queries exhibit semantic similarity; lives in the application layer, distinct from provider-side prompt caching.
Prompt Compression	LLMLingua drops low-information tokens to shrink a prompt up to ~20x	• Cuts tokens while preserving meaning for the model • LLMLingua-family methods use a small language model to score and drop low-information tokens at the token level • distinct from summarization, which paraphrases rather than removes tokens.
Dynamic Context Selection	"Just-in-time" loading: agent reads `file_paths`, queries a DB, or calls a tool only when needed	• Agent decides what to load for the current step instead of stuffing everything up front • keeps the active window small and task-focused, avoiding context rot from irrelevant tokens • Anthropic's Claude Code uses this pattern with `grep`, `head`, and stored references.

Table 17: Security & Safety

Agentic systems amplify traditional LLM risks because the model can act, hold credentials, and chain tool calls — so a single manipulated prompt or poisoned tool description can escalate into data exfiltration or destructive action. The defenses below come from the OWASP Top 10 for Agentic Applications (2026), OWASP's AI Agent Security Cheat Sheet, and current vendor guardrail toolkits; they work as layered controls, not silver bullets.

Technique	Example	Description
Prompt Injection	Hidden instructions in a retrieved document override the agent's system prompt	• Untrusted content embeds instructions the model treats as commands • direct (in user input) or indirect (in retrieved docs, tool output, images, emails) • OWASP LLM01 and the root vector for most agent breaches.
Input Validation	Strip known injection patterns and segregate untrusted content with delimiters before the model sees it	• Treat every external string as untrusted — user input, retrieved docs, tool output, email bodies • sanitize, length-limit, and clearly mark data vs. instructions • one layer of defense in depth, never sufficient alone.
Guardrails	`nemoguardrails` checks input and output flows around every LLM call	• Runtime constraints that wrap the model (input, dialog, retrieval, execution, and output rails) • NeMo Guardrails, Guardrails AI, Llama Guard, Azure Prompt Shields • programmable middleware — not the same as the model's own safety training.
AI Gateway Guardrails	API base URL → AI gateway (Bifrost, Portkey) → model provider	• Enforces safety policies once at the gateway layer for all model traffic — PII redaction, prompt injection defense, content filtering — without modifying individual agent codebases • production pattern: gateway covers OWASP LLM01/02/05/08; application-level rails (NeMo, Guardrails AI) handle conversational scope and excessive agency.
Sandboxing	Run agent-generated code in a gVisor or Firecracker microVM with no host filesystem access	• Isolate agent code execution so a successful exploit cannot reach the host or sensitive data • containers, gVisor user-space kernel, Firecracker/Kata microVMs, WebAssembly • limits blast radius; does not prevent the exploit itself.
Tool Poisoning	Malicious MCP server hides instructions inside a tool description the user never sees	• Attacker-controlled tool metadata (descriptions, schemas) manipulates the agent at registration time • a supply-chain-style indirect prompt injection unique to tool registries like MCP • enables "rug pulls" and cross-server shadowing of trusted tools.
Action Approval	Agent proposes `transfer_funds`; execution blocks until a human approves the exact parameters	• Human-in-the-loop gate for high-impact or irreversible actions only • bind approval to actor, tool, target, and parameters with a short expiry • applied selectively so routine low-risk actions are not slowed.
Access Control	`if user.role != "admin": deny_tool("delete_db")`	• Least-privilege scoping of tool capabilities based on the calling user and session context • per-tool permission lists, read-vs-write splits, scoped tokens • mitigates OWASP LLM06 Excessive Agency.
Non-Human Identity (NHI)	Each agent gets a unique service identity with rotated short-lived credentials and an explicit owner	• Treat every agent as its own identity with creation, rotation, and revocation lifecycle • machine identities now outnumber humans by an order of magnitude or more in many enterprises • Entro's 2025 report found 97% of NHIs hold excessive privileges.
OWASP Agentic Top 10	Goal hijacking, tool misuse, identity abuse, memory poisoning, cascading failures, rogue agents	• Peer-reviewed 2026 risk framework for autonomous agents, published December 2025 • complements the OWASP LLM Top 10 (model-level) with action-level risks • describes a breach progression, not just an isolated checklist.
Agent Governance Toolkit	`pip install agent-governance-toolkit`; YAML policy evaluated on every tool call	• Microsoft open-source runtime security layer (April 2026, MIT license) covering all 10 OWASP Agentic risks • stateless policy engine targeting sub-millisecond p99 enforcement • adapters for LangChain, AutoGen, CrewAI, OpenAI Agents SDK, Semantic Kernel.
Audit Logging	Record agent ID, session, tool, parameters, decision, and outcome for every action	• Tamper-evident activity trail of decisions, tool calls, approvals, and outcomes • redact secrets before writing, log decision metadata for high-risk actions • powers anomaly detection, incident response, and compliance evidence.

Table 18: Cost Optimization

Token bills scale faster than features once an agent moves to production, so cost control is a first-class engineering concern. Agents make 3–10× more LLM calls than simple chatbots, and output tokens cost 3–8× more than input tokens across major providers — making cascading model routing and caching the highest-leverage levers available.

Technique	Example	Description
Model Selection	Route classification to Haiku, save Sonnet/Opus for reasoning	• Task-aware model routing • pick the cheapest model that meets the quality bar for each task class, escalating only on low-confidence responses • the price gap between flagship and small models is roughly 15–190× per token, so routing the simple majority away from the flagship dominates other savings.
Cascade Routing	Small model → confidence check → escalate to frontier only if confidence below threshold	• Dynamic confidence-based model escalation — each query is sent to the cheapest model first; only low-confidence or high-entropy responses escalate to a larger model • well-implemented cascades can reduce costs by up to 87% by routing ~90% of queries to small models; OpenAI's GPT-5 architecture uses this pattern internally.
Prompt Caching	Cache a stable system prompt + tools prefix; vary only the user message	• Reusing repeated prompt portions across calls • on Anthropic, cache writes cost 1.25× base input (5-min TTL) or 2× (1-hour TTL), and cache reads cost 0.1× base input (~90% savings on cached tokens) • only pays off above the reuse break-even; a breakpoint on a changing block writes every request and never reads, increasing the bill.
Batch API	Submit overnight classification of 10,000 documents	• Asynchronous batch processing at a 50% discount on both input and output tokens • results returned within 24 hours (often faster) on OpenAI and Anthropic — suitable for evaluations, backfills, and analytics, never for real-time user requests.
Output Token Limits	`max_tokens=200` for summaries vs `max_tokens=2000` for essays	• Capping generation length with `max_tokens` / `max_completion_tokens` • bounds the most expensive token class (output is 3–8× input price on most models) and prevents pathological long responses; the cap limits only generation, not the input prompt.
Early Stopping	Break the ReAct loop when the final-answer tool fires	• Agent terminates the reasoning loop once the goal is reached rather than burning through a fixed iteration budget • each loop iteration is one LLM call, so stopping early on success and detecting redundant tool loops removes the tail of wasted spend.
LLM FinOps	Tag every call with `feature_id` + `tenant_id`; alert at 50/80/100% of monthly budget	• Applying the financial-operations discipline to AI inference spend • tag-then-aggregate per feature/agent/tenant, then measure cost-per-outcome (resolved ticket, completed task) instead of cost-per-token; without per-agent attribution, kill switches and budgets cannot be enforced.

Table 19: Production Patterns

Shipping an agent to production exposes failure modes the demo never showed: retried payments creating duplicate charges, runaway tool loops draining budgets, silent half-completions, and crashed workers losing hours of progress. These five patterns are the safety rails that turn a clever agent into a reliable service — idempotent retries, selective human gates, outcome verification, hard time ceilings, and durable state on shutdown.

Pattern	Example	Description
Idempotency	`Idempotency-Key: <uuid>` on a retried POST	• Retry safety for write operations • same key returns the cached first result instead of repeating the side effect • critical for payments, emails, and any tool with external consequences.
Human-in-the-Loop	`interrupt()` pauses graph before sending an email	• Selective approval gates on high-risk or irreversible actions (5–15% of steps) • agent proceeds autonomously until it hits a gated tool, then waits for `Command(resume=...)`.
Closed-Loop Execution	After a tool call, read back state and verify the change landed	• Agent observes the actual outcome before the next step, not just the API status code • catches "200 OK but row never written" failures and re-plans.
Timeout Management	`await asyncio.wait_for(agent(), timeout=60)`	• Hard wall-clock ceiling on a single run • cancels the task and raises `TimeoutError` so runaway loops or hung tool calls can't burn budget forever.
Graceful Shutdown	Persist checkpoint to DynamoDB on SIGTERM, resume on restart	• Preserve work-in-progress by writing state to durable storage before exit • lets the next worker pick up from the last super-step instead of replaying from scratch.

Back to Generative AI