Constitutional AI and Alignment Cheat Sheet

Updated 2026-03-17

Next Topic: Context Engineering Cheat Sheet

Constitutional AI represents a paradigm shift in aligning large language models with human values by training them to follow predefined ethical principles—a "constitution"—rather than relying solely on extensive human feedback. This approach combines reinforcement learning from AI feedback (RLAIF) with self-critique mechanisms, enabling models to iteratively improve their alignment with harmlessness, helpfulness, and honesty criteria. The methodology addresses scalability challenges inherent in traditional human feedback approaches while maintaining transparency through explicitly defined principles. As AI systems grow more capable, constitutional alignment becomes critical for ensuring they remain safe, interpretable, and aligned with societal values—even when their capabilities exceed human oversight capacity.

What This Cheat Sheet Covers

This topic spans 12 focused tables and 96 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Foundational Alignment ApproachesTable 2: HHH Principles and Safety ObjectivesTable 3: Preference Learning and Reward ModelingTable 4: Training Optimization AlgorithmsTable 5: Self-Critique and Iterative RefinementTable 6: Red Teaming and Safety TestingTable 7: Alignment Attacks and VulnerabilitiesTable 8: Interpretability and MonitoringTable 9: Governance and Evaluation FrameworksTable 10: Advanced Training TechniquesTable 11: Evaluation Metrics and BenchmarksTable 12: Emerging Safety Challenges

Table 1: Foundational Alignment Approaches

Method	Example	Description
Constitutional AI (CAI)	Self-critique → revision loop based on principles	• AI system learns to align with a written constitution through supervised learning and RLAIF • combines critique and revision phases for harmlessness training
Reinforcement Learning from Human Feedback (RLHF)	Train reward model on human preferences → optimize policy with PPO	• Three-stage process: supervised fine-tuning (SFT), reward model training on preference pairs, policy optimization using RL • industry standard for ChatGPT and Claude
Reinforcement Learning from AI Feedback (RLAIF)	AI generates preference labels → train reward model	• Replaces human labelers with AI-generated preferences • scales more efficiently than RLHF while achieving comparable performance • reduces annotation costs
Supervised Fine-Tuning (SFT)	Finetune on curated prompt-completion pairs	• Initial alignment step before RL training • teaches model basic desired behaviors through demonstration • foundation for subsequent RLHF/RLAIF

Table 1: Foundational Alignment Approaches

Method	Example	Description
Constitutional AI (CAI)	Self-critique → revision loop based on principles	• AI system learns to align with a written constitution through supervised learning and RLAIF • combines critique and revision phases for harmlessness training
Reinforcement Learning from Human Feedback (RLHF)	Train reward model on human preferences → optimize policy with PPO	• Three-stage process: supervised fine-tuning (SFT), reward model training on preference pairs, policy optimization using RL • industry standard for ChatGPT and Claude
Reinforcement Learning from AI Feedback (RLAIF)	AI generates preference labels → train reward model	• Replaces human labelers with AI-generated preferences • scales more efficiently than RLHF while achieving comparable performance • reduces annotation costs
Supervised Fine-Tuning (SFT)	Finetune on curated prompt-completion pairs	• Initial alignment step before RL training • teaches model basic desired behaviors through demonstration • foundation for subsequent RLHF/RLAIF