Concept · 4 episode(s)

DPO

Definition

DPO (Direct Preference Optimization) trains a model directly on preference pairs without explicitly fitting a separate reward model or running RL. It’s much simpler than RLHF, often matches it on benchmarks, and has become a default for preference tuning.

Episodes covering this

173
The Free Step-Level Grader Hiding in Every RL Training Run
Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents
Oh, Li, Park et al. · University of Wisconsin–Madison·22 min·Jun 25, 2026
172
One Bad Token Can Sink a Model's Math, And You Can Delete It
Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning
Ko, Kang, Lee · Seoul National University·22 min·Jun 25, 2026
091
When Better Fine-Tuning Can't Help: A Geometric Impossibility in LLM Causal Reasoning
Why LLMs Fail at Causal Discovery and How Interventional Agents Escape
Roy, Parbhoo · SIRE·24 min·May 28, 2026
004
The Sycophancy Circuit That Survives Alignment Training
LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit
Pandey · Georgia Institute of Technology·29 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model