Definition
A way to train a model to prefer better answers without the full reinforcement learning setup.
Direct Preference Optimization, a method that fine-tunes a policy from preference pairs by directly optimizing a classification-style loss without an explicit reward model.