LLM Guardrails and Safety Patterns Cheat Sheet

Updated 2026-05-18

Next Topic: LLM Observability Cheat Sheet

LLM guardrails are runtime controls that validate, filter, and constrain large language model inputs and outputs against security, safety, and compliance policies before responses reach users or trigger downstream actions. Unlike traditional software guardrails that enforce deterministic rules, LLM guardrails must handle the non-deterministic, probabilistic nature of generative AI—detecting prompt injections, toxic content, hallucinations, PII leaks, and jailbreak attempts in natural language. As of 2026, guardrails have shifted from an optional safety layer to production infrastructure essential for enterprise AI applications, driven by regulatory requirements (GDPR, HIPAA, EU AI Act), security imperatives (OWASP LLM Top 10), and business risk (brand damage, compliance violations). The key mental model: guardrails are defense-in-depth layers applied at input validation, model execution, output filtering, and retrieval—not a single checkpoint. Production systems typically combine 3–7 guardrail types in a layered architecture, trading off latency (5–150ms overhead) for risk reduction. Critical insight: no guardrail is perfect—prompt injection remains unsolved in 2026, and adversarial attacks evolve faster than defenses; effective safety requires treating guardrails as risk reduction layers, not hard security boundaries, backed by monitoring, red teaming, and incident response.

What This Cheat Sheet Covers

This topic spans 12 focused tables and 96 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Guardrail Rail Types and Execution LayersTable 2: Core Detection Techniques and ApproachesTable 3: Prompt Injection and Jailbreak DefensesTable 4: Content Safety and Toxicity ControlsTable 5: PII Detection and Data ProtectionTable 6: Factuality and Hallucination MitigationTable 7: Framework and Platform ImplementationsTable 8: Guardrails for RAG and Retrieval SystemsTable 9: Agent and Tool Execution SafetyTable 10: Observability, Monitoring, and Feedback LoopsTable 11: Bias Detection and Fairness ControlsTable 12: OWASP Top 10 LLM Vulnerabilities and Mitigations

Table 1: Guardrail Rail Types and Execution Layers

Guardrails operate at distinct pipeline stages—input validation runs before the LLM sees the request, output filtering runs after generation, retrieval guardrails inspect external documents, and dialogue rails shape conversational flow. Each layer addresses different threat vectors: input rails block malicious prompts, output rails catch toxic or hallucinated responses, retrieval rails prevent vector database poisoning, and dialogue rails enforce conversation boundaries. The layered approach provides defense-in-depth: if prompt injection bypasses input filters, output validation may still catch the malicious result. Production systems typically deploy 3–5 rail types simultaneously—the tradeoff is latency (each layer adds 10–50ms) versus coverage. Most enterprises start with input PII and output toxicity filters (highest ROI), then add prompt injection detection, RAG validation, and topic guardrails based on risk assessment. Architecture matters: running rails in parallel reduces latency but misses cross-layer dependencies; sequential execution catches more threats but increases response time.

Type	Example	Description
Input Guardrails (Pre-LLM)	`if detect_injection(user_msg):` `return "Blocked"`	Run before the LLM sees the request — validate input syntax, detect prompt injections, redact PII, check topic boundaries, enforce content policies, and sanitize user queries to prevent manipulation or data leakage before inference.
Output Guardrails (Post-LLM)	`if toxic_score > 0.8:` `response = fallback_msg`	Run after generation completes — filter toxic language, detect hallucinations, verify factual grounding, redact sensitive data, enforce structured schemas, and validate citation accuracy before the response reaches users or downstream systems.
Retrieval Guardrails (RAG-specific)	`if relevance_score < 0.5:` `reject_document()`	Validate external documents before context injection — check retrieval relevance, detect corpus poisoning, verify source trustworthiness, enforce context length limits, and prevent indirect prompt injection via RAG document manipulation.

Table 1: Guardrail Rail Types and Execution Layers

Type	Example	Description
Input Guardrails (Pre-LLM)	`if detect_injection(user_msg):` `return "Blocked"`	Run before the LLM sees the request — validate input syntax, detect prompt injections, redact PII, check topic boundaries, enforce content policies, and sanitize user queries to prevent manipulation or data leakage before inference.
Output Guardrails (Post-LLM)	`if toxic_score > 0.8:` `response = fallback_msg`	Run after generation completes — filter toxic language, detect hallucinations, verify factual grounding, redact sensitive data, enforce structured schemas, and validate citation accuracy before the response reaches users or downstream systems.
Retrieval Guardrails (RAG-specific)	`if relevance_score < 0.5:` `reject_document()`	Validate external documents before context injection — check retrieval relevance, detect corpus poisoning, verify source trustworthiness, enforce context length limits, and prevent indirect prompt injection via RAG document manipulation.