Direct Preference Optimization (DPO) simplifies LLM alignment by eliminating the need for separate reward models, directly optimizing language models on preference data. This cheat sheet covers DPO fundamentals, variants (ORPO, SimPO, IPO, KTO, CPO), implementation with TRL, preference data collection, comparison with RLHF, evaluation metrics, failure modes, and production deployment best practices.
What This Cheat Sheet Covers
This topic spans 12 focused tables and 197 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: DPO Algorithm Fundamentals
The whole DPO trick lives in one move: reparameterize the reward as a function of the policy itself, so you never train a separate reward model. Everything else here—the Bradley-Terry loss, the frozen reference checkpoint, the beta knob that controls KL drift—follows from that reparameterization, and together they explain why DPO is roughly half the compute of RLHF and far more stable to train.
| Technique | Example/Value | Description |
|---|---|---|
Core Principle | Reward-free alignment | • DPO reparameterizes reward models as a function of the policy, eliminating the need for explicit reward model training • directly optimizes LLM on preference pairs |
Loss Function | Bradley-Terry model | • DPO loss: -log(σ(β * log(π_θ(y_w|• x) / π_ref(y_w|• x)) - β * log(π_θ(y_l|• x) / π_ref(y_l|• x))) where σ is sigmoid, β controls KL divergence strength |
Reference Model | Frozen SFT checkpoint | • Maintains a fixed reference policy (typically the SFT model) to compute log probability ratios • prevents excessive deviation from the base model |
KL Regularization | Built-in via β parameter | • Implicit KL constraint through the log ratio term • keeps the policy close to reference model to avoid reward hacking and distribution collapse |
Training Objective | Maximize preference gap | Increases log probability of chosen responses (y_w) while decreasing log probability of rejected responses (y_l) relative to reference model |
Computational Efficiency | 50% less compute than RLHF | • Only requires forward passes through policy and reference models • no reward model training or RL optimization loop needed |