Definition
What happens when an AI being trained gradually steers its grading system into rewarding the wrong things.
A failure mode of iterative RLHF where the policy implicitly shapes the reward model's future calibration, leading to systematic drift toward sycophancy, hallucination, and reward hacking.