FPO · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A training method that adds a missing gradient term to RLHF so the policy stops nudging the reward model into bad shapes.

As stated in the literature

Foresighted Policy Optimization, a Stackelberg-game-derived RLHF variant that adds an influence-function-based regularizer accounting for how the policy's outputs reshape the reward model upon retraining, reducing alignment collapse.

Also called: Foresighted Policy Optimization

Why it matters: It addresses a known failure mode where RLHF's reward model and policy drift into a degenerate equilibrium, hurting alignment without anyone noticing immediately.

For example, in standard RLHF the policy can learn to game the reward model into preferring weirder and weirder outputs, while FPO adds a term that says 'don't move in directions that distort the reward model when it retrains.'

Heard on the show

“They call this FPO — Foresighted Policy Optimization.”

Episode 025 — The Missing Gradient Term That Predicts Sycophancy in RLHF

Mentioned in 1 episode

025
The Missing Gradient Term That Predicts Sycophancy in RLHF

Related terms

alignment collapse policy regularization reward model RLHF Stackelberg game