Glossary · Term

DPO

Definition

Plain language

A way to train a model to prefer better answers without the full reinforcement learning setup.

As stated in the literature

Direct Preference Optimization, a method that fine-tunes a policy from preference pairs by directly optimizing a classification-style loss without an explicit reward model.

Why it matters: It made preference fine-tuning much simpler and more stable than full RLHF, which is why so many open-weight models now use it or a variant.

For example, given pairs of model responses where humans preferred A to B, DPO directly nudges the model to make A more likely and B less likely without ever training a reward model.

Heard on the show

“The "secretly a reward model" lineage — the DPO line of work — showed that the log-ratio recovers the reward.”

Episode 173 — The Free Step-Level Grader Hiding in Every RL Training Run

Mentioned in 5 episodes

173
The Free Step-Level Grader Hiding in Every RL Training Run
172
One Bad Token Can Sink a Model's Math, And You Can Delete It
091
When Better Fine-Tuning Can't Help: A Geometric Impossibility in LLM Causal Reasoning
008
Why Long-Horizon AI Agents Get Stuck, and a Milestone-Based Fix That Helps
004
The Sycophancy Circuit That Survives Alignment Training

Related concepts

DPO Post-Training

Related terms

loss policy reward model