Concept · 1 episode(s)

DPO

← all concepts

Definition

DPO (Direct Preference Optimization) trains a model directly on preference pairs without explicitly fitting a separate reward model or running RL. It’s much simpler than RLHF, often matches it on benchmarks, and has become a default for preference tuning.

Episodes covering this