Direct Preference Optimization (DPO) simplifies LLM alignment by eliminating the need for separate reward models, directly optimizing language models on preference data. This cheat sheet covers DPO fundamentals, variants (ORPO, SimPO, IPO, KTO, CPO), implementation with TRL, preference data collection, comparison with RLHF, evaluation metrics, failure modes, and production deployment best practices.
What This Cheat Sheet Covers
This topic spans 12 focused tables and 197 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: DPO Algorithm FundamentalsTable 2: DPO vs RLHF ComparisonTable 3: Preference Data Collection & FormattingTable 4: DPO Variants (ORPO/SimPO/IPO/KTO/CPO)Table 5: Preference Pair ConstructionTable 6: Loss Functions & HyperparametersTable 7: Reward Models & Bradley-TerryTable 8: Evaluation Metrics & BenchmarksTable 9: Failure Modes & Mitigation StrategiesTable 10: Training Best Practices & OptimizationTable 11: Implementation with TRL & ToolsTable 12: Production Deployment & Advanced Topics
Table 1: DPO Algorithm Fundamentals
| Technique | Example/Value | Description |
|---|---|---|
Core Principle | Reward-free alignment | DPO reparameterizes reward models as a function of the policy, eliminating the need for explicit reward model training; directly optimizes LLM on preference pairs |
Loss Function | Bradley-Terry model | DPO loss: -log(σ(β * log(π_θ(y_w|x) / π_ref(y_w|x)) - β * log(π_θ(y_l|x) / π_ref(y_l|x))) where σ is sigmoid, β controls KL divergence strength |
Reference Model | Frozen SFT checkpoint | Maintains a fixed reference policy (typically the SFT model) to compute log probability ratios; prevents excessive deviation from the base model |
KL Regularization | Built-in via β parameter | Implicit KL constraint through the log ratio term; keeps the policy close to reference model to avoid reward hacking and distribution collapse |
Training Objective | Maximize preference gap | Increases log probability of chosen responses (y_w) while decreasing log probability of rejected responses (y_l) relative to reference model |
Computational Efficiency | 50% less compute than RLHF | Only requires forward passes through policy and reference models; no reward model training or RL optimization loop needed |