Definition
A training method that adds a missing gradient term to RLHF so the policy stops nudging the reward model into bad shapes.
Foresighted Policy Optimization, a Stackelberg-game-derived RLHF variant that adds an influence-function-based regularizer accounting for how the policy's outputs reshape the reward model upon retraining, reducing alignment collapse.
Also called: Foresighted Policy Optimization