Definition
DPO (Direct Preference Optimization) trains a model directly on preference pairs without explicitly fitting a separate reward model or running RL. It’s much simpler than RLHF, often matches it on benchmarks, and has become a default for preference tuning.