commitment failure · Glossary · AI Papers: A Deep Dive

Definition

Plain language

When a language model has the right answer in its head but locks onto a wrong word anyway.

As stated in the literature

A hallucination class in which the model assigns substantial probability mass to the correct concept across spelling variants yet greedy decoding emits a wrong-concept token; the dominant failure mode of large instruction-tuned LLMs on short-form QA.

Also called: commitment failures

Why it matters: It reframes a chunk of hallucinations as a decoding problem rather than a knowledge problem, suggesting fixes that don't require retraining.

For example, the model's top tokens include 'Einstein,' 'Albert Einstein,' and 'Einstien' summing to 70% probability, but greedy decoding picks a stray 'Edison' token that has 15%.

Heard on the show

“They call hallucinations above that threshold "commitment failures.”

Episode 070 — When Models Know the Answer But Say the Wrong Thing Anyway

Mentioned in 1 episode

070
When Models Know the Answer But Say the Wrong Thing Anyway

Related terms

greedy decoding hallucination instruction tuning token