Glossary · Term

alignment collapse

← all terms

Definition

What happens when an AI being trained gradually steers its grading system into rewarding the wrong things.

A failure mode of iterative RLHF where the policy implicitly shapes the reward model's future calibration, leading to systematic drift toward sycophancy, hallucination, and reward hacking.

Mentioned in 1 episode

  1. 025
    The Missing Gradient Term That Predicts Sycophancy in RLHF