Glossary · Term

commitment failure

← all terms

Definition

When a language model has the right answer in its head but locks onto a wrong word anyway.

A hallucination class in which the model assigns substantial probability mass to the correct concept across spelling variants yet greedy decoding emits a wrong-concept token; the dominant failure mode of large instruction-tuned LLMs on short-form QA.

Also called: commitment failures

Mentioned in 1 episode

  1. 070
    When Models Know the Answer But Say the Wrong Thing Anyway