Glossary · Term

commitment sharpening

Definition

Plain language

The way instruction tuning makes a model's top choices more decisive and confident at each step.

As stated in the literature

The hypothesis that RLHF and instruction tuning concentrate probability mass on top tokens, unifying observed phenomena like alignment tax, calibration loss, mode collapse, and confident hallucination under a single dispositional change.

Why it matters: It offers a single mechanism that may explain why aligned models hallucinate more confidently, lose calibration, and collapse into stylistic ruts.

For example, after instruction tuning, a model that used to spread probability across 'maybe,' 'perhaps,' and 'I'm not sure' now puts almost all its mass on a single confident phrasing.

Related terms

alignment tax calibration hallucination instruction tuning mode collapse RLHF token