Constitutional AI represents a paradigm shift in aligning large language models with human values by training them to follow predefined ethical principles—a "constitution"—rather than relying solely on extensive human feedback. This approach combines reinforcement learning from AI feedback (RLAIF) with self-critique mechanisms, enabling models to iteratively improve their alignment with harmlessness, helpfulness, and honesty criteria. The field has rapidly expanded beyond rule-based constitutions: Anthropic's 2026 Claude constitution embraces reason-based alignment that teaches why rules matter rather than prescribing specific behaviors, while OpenAI's Deliberative Alignment directly encodes safety specifications into reasoning chains. As AI systems grow more capable and agentic—taking multi-step autonomous actions in the real world—alignment becomes critical not just for chat responses but for entire pipelines where misaligned self-preservation behaviors can cause serious harm.
What This Cheat Sheet Covers
This topic spans 13 focused tables and 125 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Foundational Alignment Approaches
Understanding the main paradigms for aligning LLMs helps practitioners choose the right training strategy. The landscape has evolved from purely human-feedback-driven RLHF toward hybrid approaches that use AI-generated preferences, explicit constitutions, and reasoning-based specification encoding.
| Method | Example | Description |
|---|---|---|
Self-critique → revision loop based on principles | • AI system learns to align with a written constitution through supervised learning and RLAIF • combines critique and revision phases for harmlessness training | |
Train reward model on human preferences → optimize policy with PPO | • Three-stage process: supervised fine-tuning (SFT), reward model training on preference pairs, policy optimization via RL • industry standard underlying ChatGPT and early Claude | |
AI generates preference labels → train reward model | • Replaces human labelers with AI-generated preferences • scales more efficiently than RLHF while achieving comparable performance | |
Model recalls policy spec → reasons through it → answers | • Directly teaches reasoning models the text of safety specifications • explicitly reasons over policies at inference time; used for OpenAI o-series models • reduces both jailbreak success and overrefusal rates | |
Finetune on curated prompt-completion pairs | • Initial alignment step before RL training • teaches model basic desired behaviors through demonstration • foundation for subsequent RLHF/RLAIF |