Glossary · Term

FPO

← all terms

Definition

A training method that adds a missing gradient term to RLHF so the policy stops nudging the reward model into bad shapes.

Foresighted Policy Optimization, a Stackelberg-game-derived RLHF variant that adds an influence-function-based regularizer accounting for how the policy's outputs reshape the reward model upon retraining, reducing alignment collapse.

Also called: Foresighted Policy Optimization

Mentioned in 1 episode

  1. 025
    The Missing Gradient Term That Predicts Sycophancy in RLHF