Prompt engineering is the practice of designing and optimizing textual instructions that guide large language models (LLMs) and other AI systems to generate desired outputs. Born from the rise of transformer-based models like GPT, Claude, and Gemini, prompt engineering has evolved from simple question-answer patterns into a sophisticated discipline involving reasoning frameworks, output control, and security considerations. As models grow more capable, the field is converging with context engineering β the broader practice of shaping all information a model receives β making the structure, format, and context of prompts as important as the words themselves.
16 tables, 108 concepts. Select a concept node to jump to its table row.
Table 1: Core Prompting Approaches
Foundational ways to shape an LLM's answer using the wording of a single prompt, before reaching for reasoning chains or tools. They differ mainly in how much you show the model (no examples, one, or several), how you frame the request (persona, situational context, explicit rules), and whether the model is asked to clarify the question for itself first.
| Technique | Example | Description |
|---|---|---|
Translate to French: Hello | β’ Model performs task without examples, relying solely on pre-training knowledge β’ fast but less reliable for complex or domain-specific tasks | |
English: cat β French: chatEnglish: dog β French: chienEnglish: bird β ? | β’ Provides 2β5 example input-output pairs before the query β’ significantly improves accuracy and consistency for nuanced tasks | |
Example: "angry" β negativeClassify: "delightful" β ? | β’ Single demonstration example β’ useful when task is straightforward but model needs format guidance | |
You are an expert oncologist.Explain CAR-T therapy. | β’ Assigns a persona or expertise to the model β’ most effective for controlling tone, style, and output format rather than expanding factual knowledge | |
List three benefits. Use bullet points.Keep under 50 words. | β’ Explicit directives on what, how, and constraints β’ essential for controlling output length, format, and style | |
Background: User is a beginner.Task: Explain neural networks. | Provides situational information (audience, constraints, domain) to shape response appropriately | |
Rephrase and expand this question,then answer: Why is the sky blue? | β’ Model rephrases the question before answering β’ improves accuracy by resolving ambiguity in the original phrasing |
Table 2: Reasoning and Decomposition Techniques
These prompts get a model to work through a problem instead of guessing an answer in one leap. Some show step-by-step reasoning, others split a task into smaller parts, branch and backtrack across options, or interleave reasoning with real tool calls. Their gains depend heavily on model scale, and a written reasoning trace is not a guaranteed account of what the model actually did.
| Method | Example | Description |
|---|---|---|
Q: 23 + 47 = ?A: 23 + 47 = 20 + 40 + 3 + 7 = 60 + 10 = 70 | β’ Prompts model to show step-by-step reasoning β’ an emergent ability that helps large models on math and logic, but can fail or hurt on small models | |
Let's think step by step. | β’ Triggers reasoning without examples β’ effective shortcut when few-shot is impractical; redundant on reasoning models (o1/o3/R1) | |
Generate 5 answers via CoT β select majority answer | β’ Samples multiple independent reasoning paths and takes the majority answer β’ improves reliability, but costs N times the tokens and latency | |
Evaluate 3 approaches β explore best 2 β backtrack if stuck | β’ Models reasoning as branching exploration with self-evaluation and backtracking β’ uses search over partial paths, handling planning and multi-path problems | |
Step 1: Simplify equationStep 2: Solve for x using Step 1 | β’ Decomposes a problem into ordered subproblems, each fed the previous answer β’ generalizes to problems harder than the examples shown | |
Thought: Need population dataAction: search("France population")Observation: 67M β Answer | β’ Interleaves reasoning traces with tool actions and observations β’ grounds reasoning in retrieved results, reducing hallucination | |
First, devise a plan to solve this.Then carry out the plan step by step. | β’ Model plans subtasks before executing them β’ a zero-shot method that reduces the missing-step and calculation errors of zero-shot CoT | |
Before answering, what generalprinciples apply to this problem? | β’ Model identifies high-level concepts or first principles before specifics β’ improves reasoning on knowledge-intensive and abstract problems | |
thought_1 + thought_2 β aggregated_insightLoop back for refinement | β’ Organizes reasoning as a directed graph that can merge branches and loop back β’ most flexible for complex interdependent reasoning | |
Walk me through this contextstep by step, summarizing as you go. | β’ Segments and analyzes long or chaotic contexts methodically β’ plug-and-play technique for tasks with extended or noisy input | |
Cluster questions by diversity β auto-generate CoT demos | β’ Automatically constructs chain-of-thought demonstrations without manual effort β’ samples diverse questions so the occasional wrong auto-generated chain does little harm | |
Are follow-up questions needed?Yes: What is...? β intermediate answerFinal answer: ... | β’ Model generates and answers sub-questions before the main answer β’ improves compositional and multi-hop reasoning |
Table 3: Output Control and Formatting
These techniques shape what the model returns and how it is structured, so downstream code can parse it and humans can read it. A key distinction runs through the table: prompt-only instructions (asking for JSON, a length, or "do not" rules) are soft requests the model can miss, while API-level features like structured outputs and max_tokens enforce hard constraints.
| Technique | Example | Description |
|---|---|---|
Return as JSON: {"name": str, "age": int} | β’ Enforces a specific schema (JSON, XML, YAML) β’ only schema-enforced structured outputs (constrained decoding) guarantee conformance; a prompt-only "Return JSON" can still emit invalid or extra text | |
<context>text</context><instructions>summarize</instructions> | β’ Wraps prompt sections in semantic XML tags β’ reduces ambiguity by marking boundaries; an Anthropic best practice especially effective with Claude, but clear delimiters help most models | |
## Inputtext## Outputsummary | β’ Uses markers (###, ```, ---) to separate sections β’ reduces ambiguity about what content the model should process vs. generate | | |
Summarize in exactly 3 sentences.Keep under 100 tokens. | β’ Specifies word/sentence/token count β’ a stated count is a soft target the model may miss; max_tokens is a hard truncation that can cut output mid-word and break JSON | |
<summary> <title>...</title> <body>...</body></summary> | β’ Provides a markup skeleton for the model to fill β’ keeps nested or hierarchical output consistent; especially effective with XML | |
1. Extract entities2. Classify sentiment3. Return as table | β’ Numbered steps clarify sequence and expectations β’ improves task adherence when multiple operations are required | |
Do NOT include personal opinions.Avoid bullet points. | β’ Specifies what to exclude from output β’ unreliable on its own because models handle negation poorly; pair with positive framing (say what to include) |
Table 4: Advanced Reasoning Patterns
These patterns push a model past a single answer by adding structure: writing its own prompts or examples, critiquing and verifying its work, offloading hard computation to code, or trading layout for speed. Knowing what each one actually changes (and where it quietly fails) is what separates a reliable pipeline from a fragile one.
| Pattern | Example | Description |
|---|---|---|
Generate a prompt to classify movie reviews. | β’ Model writes or optimizes prompts for a task β’ enables iterative self-improvement and automated prompt engineering | |
First, list relevant facts about photosynthesis.Now answer: What role does chlorophyll play? | β’ Model generates intermediate knowledge before answering β’ improves factual accuracy on knowledge-intensive queries, using its own training (not external retrieval) | |
Draft β Critique your draft βRevise based on feedback β repeat | β’ Model iteratively generates, critiques, and refines its own output β’ no external model needed; lifts quality on generation tasks, but self-critique alone does not reliably fix reasoning errors | |
Answer β generate verification questions βanswer each independently β revise | β’ Model plans verification questions, answers them independently of the draft, then revises β’ significantly reduces hallucinations in factual tasks | |
Keywords: protein, folding, diseaseWrite an abstract. | Provides hints or cues (keywords, themes) to steer generation toward desired content; a small trained policy model can generate the hints for a frozen LLM | |
Write Python to solve: "If x^2 = 16, find x"def solve(): return sqrt(16) | β’ Model generates executable code as the reasoning step β’ offloads arithmetic to an interpreter that runs the code, for higher accuracy | |
First: generate outline with 5 sections.Then: write each section in parallel. | β’ Creates structural outline first, then parallelizes content generation β’ reduces latency by up to 2.4x for long outputs | |
Summary 1: sparse (50 words)Summary 2: denser (same length, +3 entities)Iterate 5 times | β’ Iteratively packs more entities into a fixed-length summary β’ produces human-preferred summaries by the later steps | |
Measure uncertainty on unlabeled questions β annotate most uncertain β add to few-shot pool | β’ Uses uncertainty sampling to select which examples a human should annotate β’ improves few-shot performance with minimal human labeling | |
Recall relevant problems similar to this,then solve by analogy. | β’ Model self-generates relevant examples before solving the task β’ eliminates manual few-shot curation; improves math and code reasoning | |
Generate propositions iteratively β verify each β accumulate into final answer | β’ Uses a proposer, verifier, and reporter to build the answer from verified steps β’ verifying each proposition before accumulating it is what sets it apart from plain chain-of-thought |
Table 5: Message Roles and Context Structure
Chat models read a list of role-tagged messages rather than one block of text. The system (and newer developer) role sets standing behavior, user carries the live request, and assistant holds the model's prior replies. Because each request is stateless, your app resends the whole list every turn to maintain context, and higher-privilege roles outrank the user role when instructions conflict.
| Role | Example | Description |
|---|---|---|
You are a helpful assistant specializing in Python. | β’ Sets global behavior, persona, and constraints β’ applied before all user messages as persistent context, but it is guidance, not a security boundary | |
How do I reverse a list in Python? | β’ Contains user query or command β’ the primary input the assistant responds to | |
Use list.reverse() or slicing: lst[::-1] | β’ Model's previous response, supplied back as history β’ you can also write one to prefill or steer the next answer | |
[user] "Define recursion"[] "..."[user] "Give example" | β’ Each request is stateless, so the client resends the full history every turn β’ longer chats cost more tokens and can exceed the context window | |
[] "Always respond in JSON format" | β’ OpenAI's newer app-developer instruction role β’ ranks above user messages in the instruction hierarchy and is meant to win conflicts |
Table 6: Prompt Chaining and Workflow Orchestration
Once a task is too big for one prompt, you compose several model calls into a pipeline. These patterns range from simple sequential chains to retrieval, tool use, routing, and self-directing agents. A key theme: the model proposes structured steps, but your code executes tools, routes branches, and enforces stop conditions.
| Technique | Example | Description |
|---|---|---|
Prompt 1: Extract entities β output_1Prompt 2: Classify entities from {output_1} | β’ Decomposes a task into sequential LLM calls β’ each prompt's output feeds the next, so steps stay simple and easy to debug | |
1. Retrieve docs about "mitochondria"2. Prompt: "Using {docs}, explain ATP synthesis" | β’ Fetches external documents at query time and adds them to the prompt β’ grounds answers in current or proprietary data without retraining the model | |
tools: [{"name": "get_weather", "parameters": {"location": "string"}}] | β’ Model selects a structured tool schema and emits name plus arguments β’ your application code runs the tool, so validate arguments before executing | |
Agent: Plan β Act β Observe β Refine β Act | β’ Model directs its own steps, choosing tools based on each result β’ loops toward a goal, so a max-iteration cap is needed to avoid runaway cost | |
If sentiment=negative: call escalation_promptElse: call thank_you_prompt | β’ Classifies the input, then routes it to a specialized prompt β’ separates concerns so each branch stays focused on one kind of case | |
Plan all tool calls upfront βexecute β synthesize | β’ Decouples planning from observation β’ a planner writes the full plan with placeholders, workers run tools, a solver combines results, cutting LLM calls vs ReAct |
Table 7: Sample Selection and Example Design
Which examples you put in a few-shot prompt, and in what order, often moves accuracy more than how many you add. This table covers the main ways to choose demonstrations, from query-matched and balanced sets to contrastive pairs, plus the biases that make order and label balance matter.
| Strategy | Example | Description |
|---|---|---|
Choose examples most similar to query via embedding distance | β’ Provides contextually relevant demonstrations β’ often outperforms random, but similar examples cluster and can lose diversity | |
2 positive, 2 negative, 1 neutral sentiment | β’ Ensures balanced coverage of categories β’ counters majority-label bias when data is imbalanced | |
Correct: "Step A β B β C"Incorrect: "Step A β C (missing B)" | β’ Shows both correct and incorrect cases β’ helps the model see which reasoning steps to avoid | |
Place most relevant or recent examples last | β’ LLMs exhibit recency bias β’ reordering the same examples can swing accuracy from near chance to near best | |
Pick 5 random examples from dataset | β’ Baseline approach β’ fast but mirrors data skew and ignores query relevance |
Table 8: Generation Parameters and Sampling
These settings control how a model turns its next-token probabilities into actual text: how much randomness to allow, which low-probability tokens to discard, how long to keep going, and when to stop. Tuning them well is the difference between focused, parseable output and creative-but-unreliable rambling.
| Parameter | Example | Description |
|---|---|---|
temperature=0.0 (deterministic)temperature=1.0 (creative) | β’ Controls randomness, not answer quality β’ lower = more focused/repetitive, higher = more diverse/creative β’ typical range 0β2; even 0 is not guaranteed bit-for-bit identical across runs | |
top_p=0.9 | β’ Keeps the smallest token set whose cumulative probability β₯ p, then samples from it β’ adapts the candidate count to the model's confidence β’ vendors recommend tuning temperature or top_p, not both | |
max_tokens=150 | β’ Hard cap on output length that truncates the moment it is hit β’ not a target length; can cut mid-sentence and break JSON β’ prevents runaway generation and controls cost | |
top_k=40 | β’ Restricts sampling to the k most likely tokens (a fixed count) β’ simpler than top-p but a blunt cutoff that ignores the distribution's shape | |
frequency_penalty=0.5 | Reduces repetition by penalizing tokens in proportion to how often they have already appeared (count-based) | |
presence_penalty=0.6 | Encourages topic diversity with a flat one-time penalty applied once a token has appeared at all, regardless of count | |
min_p=0.05 | β’ Keeps tokens above a fraction of the top token's probability (base value Γ top probability) β’ adaptive: strict when one token dominates, relaxed when the model is uncertain β’ pairs well with temperature > 1 | |
stop=["###", "\n\n"] | β’ Terminates generation when a specified string is produced (content-based, unlike the length-based max-tokens cap) β’ useful for structured outputs and preventing runaway text |
Table 9: Multimodal and Vision-Language Prompting
Multimodal prompting feeds a model more than text. You pass an image or audio clip alongside your question, and the model reasons over both. Keep in mind these models do not "see" or "hear" perfectly: they give approximate object counts, struggle with precise spatial detail, and can hallucinate text when reading documents, so verify anything high-stakes.
| Approach | Example | Description |
|---|---|---|
[image of chart]What trend does this show? | β’ Combines visual and textual input β’ model analyzes image content to answer text query | |
[image of room]How many chairs are visible? | Model performs object counting, detection, or scene understanding from image. Counts are approximate, so verify them | |
[scanned receipt]Extract total amount. | Reads and interprets text within images, including tables, forms, and structured documents. May hallucinate plausible but wrong values | |
[photo]Generate detailed caption. | Model produces natural language description of image content | |
[two images]Which object is larger? | Requires comparison or relational reasoning across visual inputs. Precise spatial localization is unreliable | |
[audio clip]Transcribe and summarize this meeting. | β’ Processes speech or audio input natively β’ supported by multimodal models like GPT-4o for transcription, analysis, and translation |
Table 10: Safety and Robustness
Securing an LLM application means assuming its prompts and the data it reads are adversarial. These techniques cover the layered defenses that matter most: keeping untrusted input from overriding your instructions, validating what the model emits, training models to refuse harmful requests, and testing your system the way an attacker would before it ships.
| Technique | Example | Description |
|---|---|---|
Use input handling, instruction delimiters, and privilege limits | Mitigates attacks where input tries to override developer instructions or exfiltrate data. OWASP ranks prompt injection as LLM01, the top LLM risk | |
Check output against a schema, encode it, or screen with a secondary LLM | Treats model output as untrusted before it reaches a browser, database, or shell, preventing XSS, SQL injection, or command execution | |
Model self-critiques against rules like Refuse harmful requests. Be helpful and honest. | A training method (RLAIF): the model critiques and revises its own answers against a set of principles, not a runtime word filter | |
Run adversarial probes such as injection and jailbreak attempts before launch | Adversarial testing to find vulnerabilities before attackers do. Expected by the NIST AI RMF and OWASP LLM Top 10 | |
Detect attempts to bypass safety via role-play, encoding, or indirection | Models trained to recognize and refuse disguised harmful requests, targeting the model's safety rules (distinct from injection) | |
Separate trusted instructions from untrusted external data using privilege boundaries | Prevents attackers from embedding hidden instructions in documents, emails, or tool outputs the model processes. The key risk for agentic and RAG systems |
Table 11: Emotion and Persona Techniques
These techniques shape how a model speaks and reasons by giving it a role, an audience, or an emotional frame. They mostly steer tone, depth, and perspective, and their effect on factual accuracy is far weaker and less reliable than popular advice suggests.
| Technique | Example | Description |
|---|---|---|
You are a Pulitzer Prize-winning journalist.Write a headline. | β’ Assigns specific expertise or identity β’ mainly shapes tone, depth, and style, and does not reliably boost factual accuracy | |
Summon three experts (security, UX, backend).Have them collaborate on a review. | β’ One model simulates multiple expert personas collaborating in a single self-collaboration β’ produces more thorough, multi-perspective outputs | |
This is very important to my career.Please give your best answer. | β’ Adds emotional stakes or urgency β’ reported gains in earlier studies, but effects are mixed and model-dependent, often weaker on frontier models | |
Put yourself in the reader's shoes.What would they find confusing? | β’ Two-stage perspective-taking: filter context to what a character knows, then answer from that view β’ improves reasoning about beliefs and supports more empathetic responses |
Table 12: Optimization and Automation
These methods move prompt work from hand-tuning to measured, repeatable engineering: tools that auto-generate and score prompts (APE, DSPy), ways to compare and version prompts in production (A/B testing, prompt versioning), a parameter-efficient training alternative (soft prompts), and an inference trick that reuses a repeated prefix to cut cost and latency (prompt caching).
| Method | Example | Description |
|---|---|---|
Generate prompt candidates β score on a dataset β select best performer | β’ A model proposes instruction candidates, which are then scored and filtered on a validation set β’ replaces manual trial-and-error | |
Define signatures β framework compiles and optimizes prompts from examples and a metric | β’ Declarative approach where prompts are compiled, then iteratively improved against a metric, not hand-written β’ discards variations that do not score better | |
Run variant A vs B on the same inputs β measure accuracy, latency, cost β deploy winner | β’ Empirical comparison to select the best prompt for production β’ needs enough samples for statistical confidence, since LLM outputs vary | |
Learn a few continuous embedding vectors prepended to the input while the model stays frozen | β’ Trains small learnable vectors, not readable text, leaving model weights frozen β’ parameter-efficient alternative to full fine-tuning | |
Track each prompt change as an immutable, identified version with eval metrics | β’ Manages prompt iterations in production β’ enables exact rollback, A/B testing, and regression tracking | |
Place static system instructions first β variable content last | β’ Providers reuse the computed prefix for an exact match, cutting cost up to ~90% and latency up to ~80% β’ caches the input prefix, never the response; supported by OpenAI, Anthropic, Google |
Table 13: Specialized Patterns and Emerging Techniques
These are newer or niche prompting patterns, several from single recent papers, that squeeze more reliability out of a model without touching its weights. They lean on tricks like self-reflection, voting across reasoning chains, picking complex examples, and even repeating the prompt, so treat the emerging ones as promising rather than settled and verify before production use.
| Pattern | Example | Description |
|---|---|---|
Review your answer. What could be improved?Revise β iterate | β’ Model self-critiques and writes a verbal reflection it stores in memory as context for the next attempt β’ reinforces the agent without any weight update or fine-tuning | |
Select few-shot examples with the most reasoning steps | β’ Prefers demonstrations with higher reasoning complexity (longer chains) β’ can also vote over the most complex chains at decoding; raises multi-step accuracy | |
Generate explanation tree β prune contradictory branches | β’ Builds an abductive, recursive tree of explanations β’ frames the answer as a satisfiability problem over their logical relations to find the most consistent one | |
Apply self-consistency to non-reasoning tasks (e.g., classification, extraction) | β’ Has the LLM itself pick the most consistent of several candidate answers β’ extends majority-voting benefits to free-form tasks where answers cannot be counted | |
What are the causes of inflation?What are the causes of inflation? | β’ Repeating the prompt twice gives a bidirectional-context effect in causal models, reported to help non-reasoning LLMs β’ doubles input token cost but adds no generated tokens or latency | |
Recurse on sub-steps β truncate context βvote across reasoning chains | β’ Combines recursive refinement, dynamic context truncation within a token budget, and majority voting β’ helps parameter-efficient models rival larger ones; voting can still fail under shared model bias |
Table 14: Prompting for Reasoning Models
Reasoning models such as OpenAI o1/o3, Claude with extended thinking, and DeepSeek-R1 think before they answer, so they reward concise goal statements over heavy step-by-step scaffolding. These techniques cover how to steer their hidden reasoning, tune its depth against cost and latency, and avoid instructions that older models needed but these models do not.
| Technique | Example | Description |
|---|---|---|
Solve for x where 3x + 7 = 22.Show the solution process and final result. | β’ State desired outcome clearly without prescribing steps β’ reasoning models (o1/o3/R1) perform best with concise goal statements | |
thinking: {type: "enabled", budget_tokens: 10000} | β’ Allocates a reasoning scratchpad for Claude models, drawn from max_tokens and at least 1024β’ model thinks step-by-step in a hidden block before producing the answer | |
reasoning_effort: "high" | β’ Adjusts how deeply the same model reasons before answering β’ "low" for simple tasks, "high" for complex problems; controls cost and latency | |
Do not add "think step by step" to o1/o3/R1 | β’ Reasoning models already reason internally β’ explicit CoT is redundant and can increase latency without benefit |
Table 15: Domain-Specific Applications
Prompt engineering plays out differently across common LLM tasks. Code and data work reward low temperature and strict structure, summarization and translation lean on dedicated techniques like Chain of Density and few-shot terminology, and grounded question answering depends on retrieval. Knowing the right tool and setting per task is what separates a reliable pipeline from a flaky one.
| Domain | Example | Description |
|---|---|---|
Write a Python function to merge two sorted lists. | β’ Produces runnable code, but output can look correct yet miss details like an import β’ favor low temperature and always test the result | |
Extract: name, email, phone from:"Contact John at john@ex.com" | β’ Pulls structured fields from unstructured text β’ structured outputs enforce a JSON schema at decode time, far more reliable than free-text or plain JSON mode | |
Summarize this article in 2 sentences. | β’ Condenses long text into key points β’ Chain of Density packs more entities into a fixed length and reduces lead bias | |
Write a haiku about autumn. | β’ Generates poetry, stories, or dialogue β’ higher temperature adds variety, but too high turns coherent text into nonsense | |
Translate to German: "Good morning" | β’ Converts text between languages β’ few-shot examples of approved terminology improve accuracy on domain terms | |
Based on: {document}, answer: Who founded the company? | β’ Provides a factual answer from context β’ RAG grounds answers in private or fresh sources, but is unneeded for general knowledge | |
Classify sentiment: "I loved this movie!" β positive | β’ Determines emotional tone β’ few-shot with diverse examples improves handling of sarcasm and mixed reviews |
Table 16: Anti-Patterns and Common Pitfalls
These are the prompt habits that quietly wreck output quality, run up cost, or produce confidently wrong answers. For each one, the fix is usually the opposite move: be specific, decompose, show examples, set limits, mark boundaries, ground recent facts with retrieval, and match sampling to the task. The last row is an evolving guideline, heavy step-by-step scaffolding helps weaker models but can over-constrain frontier ones.
| Pattern | Example | Description |
|---|---|---|
Tell me about AI. | β’ Lacks specificity β’ produces generic, unfocused output; always specify scope, audience, or format | |
Mixing 10 unrelated tasks in one prompt | β’ Splits the model's attention, so every task gets shallow output β’ better to chain or decompose into separate prompts | |
Zero-shot on nuanced classification | β’ Underperforms without demonstrations β’ 2-5 few-shot examples teach your exact criteria; more than that tends to plateau | |
No length constraint, leading to a 5000-word response | β’ Generates unnecessarily long outputs and runs up cost at scale β’ state the length in words; max-tokens only truncates, it does not shape length | |
Input: text here Output: more text (no clear boundary) | β’ Model confuses what to process vs. generate β’ use ### or ``` to separate sections | | |
What happened last week? (model trained months ago) | β’ Model cannot access real-time data and will confidently invent recent facts β’ ground it with RAG or tool use | |
Deterministic task with temperature=1.5 | β’ Excessive randomness where consistency is needed (temperature is not truthfulness) β’ tune temperature and top-p to the task | |
10-step procedural instructions for a frontier model on an open-ended task | β’ Over-constraining can hinder autonomous reasoning in frontier models β’ describe the desired result and let it choose the route; strict steps still suit procedural, schema-bound tasks |