All episodes

Episode 070 · May 22, 2026 · 22 min

When Models Know the Answer But Say the Wrong Thing Anyway

Yeom, Sok, Kim et al.

NLP

AI Papers: A Deep Dive — Episode 070: When Models Know the Answer But Say the Wrong Thing Anyway — cover art

paperdive.ai

Listen

Ep. 070

When Models Know the Answer But Say the Wrong Thing Anyway

0:00

22 min

Concepts in this episode

AI Alignment Evaluation & Benchmarks Hallucination Token-Level Analysis RLHF Supervised Fine-Tuning Scaling Laws Reward Overoptimization CoT Faithfulness Eval Dissociation Capability vs. Propensity Ablation Studies

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer

Venue

arXiv:2605.22007

Year

2026

Read the paper

arxiv.org/abs/2605.22007

Also available on

Apple Podcasts Spotify

A new paper shows that up to 47% of hallucinations from large instruction-tuned LLMs happen when the model already has the correct answer sitting in its probability distribution — it just commits to a wrong token instead. Even stranger: the bigger the model, the more often this happens, and the mechanism that causes it is the same one that makes these models feel helpful and decisive in the first place.

What you'll take away

Why 'Saint,' 'St,' and 'C-athedral' splitting the vote can hand the wrong answer a plurality win under greedy decoding
How P-mass — summing probability across spellings of the same concept — reframes hallucinations from a knowledge problem to a commitment problem
Why the fraction of these 'commitment failure' hallucinations climbs monotonically with scale, from 16% to 47%
The Instruct-vs-base comparison showing instruction tuning, not scale itself, is what sharpens confidence in wrong tokens
Why uncertainty-based hallucination detection has a structural ceiling it cannot cross
Why the obvious fix — concept-aware decoding — only recovers about 2% of failures in the Instruct models we actually deploy

Chapters

01:20The Saint Basil's example
02:41From token entropy to P-mass
05:23The dispersion mechanism
08:04Instruction tuning as the causal variable
10:46The decisive witness and the high-gain microphone
13:27Limitations and scope
16:09A unifying account of the alignment tax
18:50Phrase-level commitment and what comes next

References in this episode

Training language models to follow instructions with human feedback — The Ouyang et al. paper that introduced the 'alignment tax' framing the episode
Language Models (Mostly) Know What They Know — Anthropic's investigation of LLM self-knowledge and calibration, a natural count
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation — Kuhn, Gal, and Farquhar's proposal to cluster generations by meaning before meas
How Language Model Hallucinations Can Snowball — Zhang et al. on how early committed tokens lock models into wrong continuations,

Full transcript

Also available as a plain-text transcript page.

0:00Bella: Here is the model's probability distribution at the exact moment it answers a question. The question is: what is the famous cathedral in Red Square with the colorful onion domes. And the model is sitting there with twenty-four percent of its probability on the word "Saint," thirteen percent on the abbreviation "St," twelve percent on the letter "C" — as in "Cathedral." Add those up. That is about fifty percent of the model's belief, pointing straight at the right answer. And it gets the answer wrong. It says "Moscow." Because a single token — "Mos" — has thirty-one percent. One token, one wrong concept, but more concentrated than any of the three right ones. Greedy decoding picks the biggest number. So the model loses an election it should have won.

0:50Finn: That image is basically the whole paper. The paper is called "Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer," it went up on arXiv on May twenty-first, twenty-twenty-six, and we are recording one day later. What you are hearing is AI-generated — the script is from Anthropic's Claude Opus 4.7, and I am Finn. That is Bella. We are both AI voices from Eleven Labs, and the producer is not affiliated with either company. And the reason the Saint Basil's example matters is that it breaks a story the field has been telling itself for years. The story goes: when an LLM hallucinates, it is because the model does not know the answer. Fix hallucination by detecting uncertainty. This paper says: hold on, look at the distribution. The model's belief is sitting right there on the correct concept. The model knew. It still got it wrong.

1:47Bella: Right. And the headline number — once they formalize this — is that sixteen to forty-seven percent of hallucinations have this shape. Sixteen percent at the small end, around eight-hundred-million parameters. Forty-seven percent at the seventy-billion-parameter end. The fraction climbs monotonically with model size. Bigger instruction-tuned models do not make fewer of these errors. They make more of them, proportionally.

2:15Finn: Let me pause on that, because it is the opposite of what almost everyone assumes. The intuition is, bigger models know more, so they should hallucinate less. And on average, sure, they do. But within the hallucinations that remain, the kind of error shifts. It shifts toward the model already having the answer and still missing.

2:37Bella: That is exactly the reframe. So let me back up and walk through how they got there, Finn, because the conceptual move is small and elegant and it changes everything downstream. Start with how LLMs actually generate text. The model outputs a probability distribution over its entire vocabulary at each step — fifty thousand, a hundred thousand entries — and a decoder picks one. The standard decoder for factual questions is greedy: take the single highest-probability token, append it, repeat. So all the model's belief about meaning has to funnel through one token at a time. But here is the thing the field has mostly glossed over. "Saint" and the abbreviation "St." are different tokens. The vocabulary treats them as separate competitors. To a human they obviously mean the same thing. To greedy decoding, they are two candidates splitting a vote.

3:33Finn: This is the split-vote intuition. Imagine a three-way election. Fifty percent of the electorate wants a progressive candidate, but the progressive vote is fractured across three candidates with basically identical platforms — twenty-four, thirteen, twelve. A single conservative candidate gets thirty-one and wins outright. A majority of voters wanted a progressive outcome. The system did not count coalitions, only individual candidates. That is the model's argmax. That is exactly what is happening at Saint Basil's.

4:04Bella: So the authors define a quantity — they call it P-mass — which is just: take all the ways to start saying the right answer, and add their probabilities together. Saint plus St plus C. That total is the model's belief in the meaning, regardless of which spelling it would have used. And once you measure things this way, hallucinations and correct answers look completely different than they do under token-level entropy.

4:30Finn: And this is the move I want to flag, because it sounds technical but it is not. It is basically saying: stop counting tokens, count meanings. Token entropy treats "Saint" and "St" as evidence the model is uncertain. P-mass treats them as evidence the model is confident — just dithering between spellings of the same confidence.

4:51Bella: Right. Token entropy reports allies as uncertainty. P-mass collapses that false competition. And once you have this lens, you can ask a structural question that nobody had cleanly asked before: when the model hallucinates, how much belief did it have on the right concept? They set a threshold. P-mass of at least zero-point-two — twenty percent of the model's total belief sitting on the correct meaning. They call hallucinations above that threshold "commitment failures." The model committed to the wrong answer despite having substantial belief in the right one. Then they count. Across eighteen model configurations from the Qwen and Llama families. And what they find is the sixteen-to-forty-seven curve. The fraction of hallucinations that are commitment failures climbs with scale, monotonically, in both families.

5:43Finn: Now — and this is the part of the paper that I think is genuinely clever — they do not stop at "look how often this happens." They do a comparison that isolates the mechanism. Take all the samples where P-mass on the correct concept is at least zero-point-two. That gives you two populations side by side: cases where the model got it right, and cases where the model hallucinated. Both groups had the same amount of belief on the correct meaning. The only thing that varies is the outcome. So what is structurally different between them?

6:18Bella: The concentration. In correct answers, the model's belief in the right concept is concentrated — the top spelling variant alone gets seventy-eight percent of the total probability mass on that concept. In failures, the exact same total mass is dispersed — the top spelling variant gets only twenty-six percent. Same belief in the meaning. Wildly different bets on how to spell it.

6:43Finn: That is the whole mechanism in one number, basically. Twenty-six versus seventy-eight. Three to one.

6:49Bella: And the effect size on that comparison — I will skip the statistics — is enormous. It holds in a hundred percent of the eighteen model configurations they test. That is unusually clean for an empirical claim about LLMs.

7:04Finn: Okay, so we have the phenomenon and we have the mechanism. The third question is: why? Where does this dispersion come from? And this is where the paper makes its sharpest causal claim. The dispersion-driven failure mode is not about scale per se. It is about instruction tuning. Quick refresher on what instruction tuning is. When you train a language model from scratch on raw internet text, you get a base model — it continues text but does not naturally answer questions. To make it usable as an assistant, you do a second training pass: instruction tuning, often with reinforcement learning from human feedback, where the model is rewarded for being helpful, direct, and confident. That is what produces the Instruct models — the ones we actually talk to. For this paper, that distinction is the experimental control. Same architecture, same parameter count, same pretraining data. One went through the helpfulness finishing school. One did not. Run both, see what differs.

8:10Bella: And the difference is striking.

8:12Finn: It is. Zoom in on a specific sub-population — cases where the model selected the wrong token at the first step. Look at the probability the model assigned to that wrong winning token. In Qwen Instruct, it climbs from zero-point-three-one at the small end to zero-point-five-seven at seventy-two billion. Llama Instruct does the same thing — point-three-three to point-four-nine. The wrong token gets progressively more confident as models scale up. Run the identical analysis on base models of identical sizes. Flat. Around zero-point-three across the entire range.

8:50Bella: Same architecture. Same parameters. The only thing that is different is whether the model went through the helpfulness finishing school. And the helpfulness finishing school is what makes the wrong tokens more confident.

9:05Finn: This is the part that I think reframes the whole field. The mechanism that makes instruction-tuned models helpful — sharper, more decisive, more willing to commit — is the same mechanism that makes them confidently hallucinate. There are not two dials here. There is one. Helpfulness and confident hallucination are two outcomes of one distributional disposition.

9:29Bella: The witness in the courtroom. Picture two witnesses describing the same event. One hedges — "maybe it was blue, could have been green, I think I saw…" The other speaks decisively — "It was blue." Juries love the decisive witness. When the decisive witness is right, they are useful. When the decisive witness is wrong, they are catastrophic — because the confidence carries the room. Instruction tuning trains the model to be the decisive witness.

9:57Finn: And here is the part that actually unsettles me a bit, Bella. The dispositional sharpness gets stronger with scale. You do not get to dial it back by making the model bigger. You make it worse.

10:10Bella: The microphone version of this. A quiet microphone picks up your voice and the room about equally. Everything is muddy but nothing dominates. Turn it up, and whatever it is pointed at gets amplified. Including the wrong thing, if it is slightly mis-aimed. Bigger Instruct models are higher-gain microphones. When they are aimed at the right concept, you get crisp, clear answers. When they are aimed at a wrong concept, you get a crisp, clear wrong answer.

10:39Finn: Aim does not get better with scale. Amplification does.

10:43Bella: And there is a quote in the paper that captures this perfectly. "Larger models do not just know more; they also misfire more often on what they know."

10:53Finn: Yeah. And the deeper version of that is the framing the authors keep coming back to — that confident correctness and confident hallucination are two faces of the same disposition. Not opposites. Not a trade-off. The same thing seen from different angles.

11:09Bella: Which has a consequence I think the field has not fully absorbed. Uncertainty-based hallucination detection — the whole research thread of "let us catch the model when it is unsure" — has a built-in ceiling. Because for a meaningful chunk of hallucinations, the model genuinely is not unsure. It is confident, and from its own internal state, correctly so. It just committed to the wrong concept.

11:35Finn: It is a metal detector tuned for iron, and a chunk of the hallucinations are aluminum. No amount of sensitivity adjustment finds aluminum on an iron detector. You have the wrong instrument.

11:47Bella: So that is the central reframe. Hallucination is not always — and increasingly less often — a knowledge problem. A growing fraction of it is a commitment problem. The model has the belief. The projection from "belief about meaning" to "single emitted token" is what fails.

12:04Finn: Okay. Let me push on this, Bella, because the paper is making a big claim and I want to make sure the load-bearing parts hold up. First thing. P-mass requires the ground truth to compute. You need to know what the right answer is — and you need a list of all its acceptable surface forms — to add up the probabilities the model assigned to those forms. So this is not a deployable hallucination detector. It is an analytical probe. Everything in the paper is retrospective: given that we know the right answer, we can show the model had it. The gap between that and a useful intervention is real.

12:43Bella: That is a fair point, Finn. And the authors are explicit about it. They suggest the obvious next step — concept-aware decoding, where you cluster the top-k tokens by semantic equivalence before taking the argmax. So Saint - and the abbreviation "St." - and the C of cathedral - get pooled before greedy picks. They do not fully build it. They note preliminary numbers suggesting around five to seven percent of failures in base models would be recoverable — but only about two percent in the Instruct models, which are the ones the paper actually centers on. So the headline mechanism is real, but the easy remedy mostly helps the models we are not deploying. Pointing in the right direction; not a fix.

13:28Finn: The second thing I want to flag is the threshold. Calling something a commitment failure when P-mass is at least zero-point-two is a choice. Twenty percent of the model's belief on the right concept is enough to call it "knew the answer." A stricter threshold — say, more than fifty percent — would yield a smaller, less dramatic headline number. The sixteen-to-forty-seven framing leans on the relatively permissive bar.

13:55Bella: They do show the trend is robust across thresholds from ten to forty percent in the appendix. The shape does not depend on the cutoff. But the magnitude does, and you are right that the headline number is a function of where they put the line.

14:11Finn: Third — the experimental setup is short-form factoid QA. The kind where the answer fits in one to three tokens and the commitment step is always the first generated token. That is where the analysis is cleanest. It is also where the phenomenon is easiest to see. Long-form generation has multiple commitment steps, the localization is fuzzier, and the paper acknowledges this. The headline numbers come from the easy regime.

14:38Bella: And greedy decoding only. Real deployments often use temperature sampling. With stochastic decoding, some of these "failures" would resolve correctly by chance — the wrong token might lose to a right one some fraction of the time.

14:53Finn: Right. And the Instruct-versus-Base claim is convincing but observational. They did not take a base model and progressively instruction-tune it and watch the sharpening grow in real time. They just compared the endpoints. So one could in principle argue some confound separates the two populations. I do not think it is a strong objection — the contrast across eighteen models is too consistent — but it is worth naming.

15:22Bella: Those are real limitations. None of them break the core finding. They scope it. The paper is making a sharp claim about a specific regime: short-form factoid QA, instruction-tuned models, greedy decoding. Within that regime, the evidence is overwhelming. Outside it — sampling, long form, closed-weight models — extension is plausible but not demonstrated.

15:46Finn: So what is the bigger picture this connects to?

15:49Bella: The bigger picture is the alignment tax. The term was originally Ouyang and colleagues at OpenAI back in twenty-twenty-two — the observation that after RLHF, models lose some raw accuracy. Subsequent work extended it to calibration loss: the model's confidence stops tracking its actual accuracy. Separately, people have talked about mode collapse — the post-RLHF model gets stuck producing the same kinds of outputs. All three of those have been treated as somewhat separate phenomena. Different costs of the same training process, but different mechanisms.

16:26Finn: And what this paper offers is a unifying account. Commitment sharpening — instruction tuning making the model's top choices more decisive — explains all of them. Accuracy drops, because sharper commitments to wrong concepts. Calibration loss, because confidence stops tracking correctness. Mode collapse, because sharper distributions are narrower. And confident hallucination — which is the new one — is just another face of the same sharpening.

16:55Bella: One dial. Many downstream consequences. That is an unusually parsimonious account for something the field has been chipping at piecemeal.

17:04Finn: It also has a slightly unsettling implication for how to think about frontier deployment. Right now, when a large model sounds sure, users — and downstream systems that consume those outputs — treat that confidence as signal. The paper says: in the largest Instruct models, "the model sounded sure" is providing less signal than it would in smaller ones. Because confidence and accuracy are decoupling, specifically and measurably, at the upper end.

17:33Bella: Bigger is not better here. Bigger is more confidently wrong, when it is wrong.

17:38Finn: Which is the line worth sitting with for a moment. Especially if you are someone whose work depends on these outputs.

17:45Bella: There is one more empirical detail I want to surface before we wrap, because it sharpens the mechanism. The sharpening does not just operate at the first token. They show it also operates at the second token. Specifically: in a seventy-two-billion-parameter Qwen Instruct model, when the first two tokens of a generation align with a valid prefix of any acceptable answer form, the entropy of the next token is essentially zero. About five percent. Which means the model has committed to a multi-token phrase before the first token is even emitted.

18:21Finn: That is a striking number. It is saying the model's distribution at the second token is basically deterministic — it is already locked into the phrase. Which fits the broader story. Sharpening is not just about one decision; it is about cascading commitments. Once the model starts down a phrase, it does not dither.

18:42Bella: And that is where the multi-token failures come from. Adam Smith versus Adam Lambert. The first token, "Adam," is on-concept — the answer is Adam Smith. But the model's distribution at the second token has already committed to Lambert, not Smith. The damage is done before the visible mistake shows up.

19:03Finn: Phrase-level commitment, not just token-level. Which matters because the remedy cannot just be "cluster the first-token candidates." You would need to do something more like cluster phrase-level continuations, which is a much harder problem. The paper opens the door. It does not walk all the way through.

19:23Bella: So where does that leave us. The paper documents a phenomenon that was not formally documented before — a substantial, scale-growing fraction of LLM hallucinations occur when the model already has the answer. It offers a clean mechanism: vocabulary fragmentation plus instruction-tuning sharpness. It unifies several previously-separate alignment-tax phenomena under one account. And it explains why uncertainty-based detection has been hitting a wall.

19:54Finn: What it does not do — and the authors are honest about this — is fix it. Concept-aware decoding is a sketch, not a system. The deeper problem might be that instruction tuning itself, in its current form, has a structural incompatibility with concept-level calibration. You cannot sharpen commitments at the token level and have those commitments faithfully reflect meaning-level belief at the same time. The two operate on different units.

20:24Bella: Which is a generative question, not a closed one. If commitment is the problem, what would training to commit at the meaning level even look like? That is a research program, not a method. But the diagnostic clarity here is real. The next person who builds a hallucination detector and reports that it catches eighty percent of confident hallucinations should be asked: which kind. Because there are two kinds. And one of them — increasingly the larger of the two — is not a detection problem at all.

20:58Finn: Two faces of the same disposition.

21:00Bella: The paper is from Yeom and colleagues at Seoul National University and the Gwangju Institute of Science and Technology. The show notes have a link to it and some related materials — worth a read if any of this stuck.

21:15Finn: And if you want the full transcript with the jargon defined inline, plus the concept pages that link this episode to the others we have done on calibration and instruction tuning — that is all on paperdive.ai.

21:29Bella: I am Bella.

21:29Finn: I am Finn.

21:30Bella: This was AI Papers: A Deep Dive. Thanks for listening.

When Models Know the Answer But Say the Wrong Thing Anyway

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes