Direct Preference Optimization (DPO) and Alignment Methods Cheat Sheet

Next Topic: Document AI and Intelligent Document Processing Cheat Sheet

Direct Preference Optimization (DPO) simplifies LLM alignment by eliminating the need for separate reward models, directly optimizing language models on preference data. This cheat sheet covers DPO fundamentals, variants (ORPO, SimPO, IPO, KTO, CPO), implementation with TRL, preference data collection, comparison with RLHF, evaluation metrics, failure modes, and production deployment best practices.

What This Cheat Sheet Covers

This topic spans 12 focused tables and 197 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: DPO Algorithm FundamentalsTable 2: DPO vs RLHF ComparisonTable 3: Preference Data Collection & FormattingTable 4: DPO Variants (ORPO/SimPO/IPO/KTO/CPO)Table 5: Preference Pair ConstructionTable 6: Loss Functions & HyperparametersTable 7: Reward Models & Bradley-TerryTable 8: Evaluation Metrics & BenchmarksTable 9: Failure Modes & Mitigation StrategiesTable 10: Training Best Practices & OptimizationTable 11: Implementation with TRL & ToolsTable 12: Production Deployment & Advanced Topics

Table 1: DPO Algorithm Fundamentals

The whole DPO trick lives in one move: reparameterize the reward as a function of the policy itself, so you never train a separate reward model. Everything else here—the Bradley-Terry loss, the frozen reference checkpoint, the beta knob that controls KL drift—follows from that reparameterization, and together they explain why DPO is roughly half the compute of RLHF and far more stable to train.

Technique	Example/Value	Description
Core Principle	Reward-free alignment	• DPO reparameterizes reward models as a function of the policy, eliminating the need for explicit reward model training • directly optimizes LLM on preference pairs
Loss Function	Bradley-Terry model	• DPO loss: `-log(σ(β * log(π_θ(y_w&#124• x) / π_ref(y_w&#124• x)) - β * log(π_θ(y_l&#124• x) / π_ref(y_l&#124• x)))` where σ is sigmoid, β controls KL divergence strength
Reference Model	Frozen SFT checkpoint	• Maintains a fixed reference policy (typically the SFT model) to compute log probability ratios • prevents excessive deviation from the base model
KL Regularization	Built-in via β parameter	• Implicit KL constraint through the log ratio term • keeps the policy close to reference model to avoid reward hacking and distribution collapse
Training Objective	Maximize preference gap	Increases log probability of chosen responses (y_w) while decreasing log probability of rejected responses (y_l) relative to reference model
Computational Efficiency	50% less compute than RLHF	• Only requires forward passes through policy and reference models • no reward model training or RL optimization loop needed

Table 1: DPO Algorithm Fundamentals

Technique	Example/Value	Description
Core Principle	Reward-free alignment	• DPO reparameterizes reward models as a function of the policy, eliminating the need for explicit reward model training • directly optimizes LLM on preference pairs
Loss Function	Bradley-Terry model	• DPO loss: `-log(σ(β * log(π_θ(y_w&#124• x) / π_ref(y_w&#124• x)) - β * log(π_θ(y_l&#124• x) / π_ref(y_l&#124• x)))` where σ is sigmoid, β controls KL divergence strength
Reference Model	Frozen SFT checkpoint	• Maintains a fixed reference policy (typically the SFT model) to compute log probability ratios • prevents excessive deviation from the base model
KL Regularization	Built-in via β parameter	• Implicit KL constraint through the log ratio term • keeps the policy close to reference model to avoid reward hacking and distribution collapse
Training Objective	Maximize preference gap	Increases log probability of chosen responses (y_w) while decreasing log probability of rejected responses (y_l) relative to reference model
Computational Efficiency	50% less compute than RLHF	• Only requires forward passes through policy and reference models • no reward model training or RL optimization loop needed