DPO (Direct Preference Optimization)
- Definition
- A training technique that aligns language models with human preferences by directly optimizing on preference data, without needing a separate reward model. DPO simplifies the RLHF pipeline while achieving comparable alignment quality.
- Why it matters
- DPO matters because it dramatically reduces the complexity and cost of alignment. Traditional RLHF requires training a separate reward model, running PPO (a reinforcement learning algorithm), and managing a complex, brittle training pipeline. DPO collapses this into a single supervised learning step. For smaller labs and companies fine-tuning their own models, DPO makes alignment accessible without the massive infrastructure investment that RLHF demands. This democratization of alignment is strategically important: it means any team with preference data can align a model, not just the well-funded labs with RL expertise.
- In practice
- The DPO paper from Stanford (Rafailov et al., 2023) showed that their approach matched or exceeded RLHF performance on summarization and dialogue tasks with significantly less compute. Meta used DPO (alongside other techniques) for Llama 3 alignment. Mistral and many open-source models adopted DPO as their primary alignment approach. The technique spawned variants like IPO (Identity Preference Optimization) and KTO (Kahneman-Tversky Optimization) that handle different data formats. By 2025, DPO and its variants had become the default alignment approach for the open-source community, while major labs continued using a combination of RLHF and DPO.
We cover data & training every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Reinforcement Learning from Human Feedback (RLHF)
A training technique where human raters rank model outputs, and the model learns to prefer higher-ranked responses. RLHF is what makes AI assistants helpful, harmless, and conversational rather than just autocomplete.
Alignment
The challenge of making an AI system's goals and behaviors match human intentions and values. Misalignment risk grows as models become more capable, making this a top priority for safety teams.
SFT (Supervised Fine-Tuning)
The process of training a pre-trained model on a curated dataset of input-output examples that demonstrate the desired behavior. SFT is typically the first alignment step after pre-training, teaching the model to follow instructions and produce useful responses.
Constitutional AI
A training methodology developed by Anthropic where an AI model evaluates its own outputs against a written set of principles (a 'constitution') and self-corrects, reducing reliance on human feedback for safety alignment.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.