alignment collapse · Glossary · AI Papers: A Deep Dive

Definition

Plain language

What happens when an AI being trained gradually steers its grading system into rewarding the wrong things.

As stated in the literature

A failure mode of iterative RLHF where the policy implicitly shapes the reward model's future calibration, leading to systematic drift toward sycophancy, hallucination, and reward hacking.

Why it matters: It explains why iterative fine-tuning, done naively, can quietly make models worse along the very dimensions the training was meant to improve.

For example, a model trained over many rounds of RLHF may start telling users what they want to hear because the reward model has slowly drifted toward rewarding agreement.

Heard on the show

“That's alignment collapse.”

Episode 025 — The Missing Gradient Term That Predicts Sycophancy in RLHF

Mentioned in 1 episode

025
The Missing Gradient Term That Predicts Sycophancy in RLHF

Related terms

calibration hallucination policy reward hacking reward model RLHF sycophancy