All episodes

Episode 140 · Jun 12, 2026 · 27 min

When a Reasoning Model Says "Let Me Double-Check" After It's Already Decided

Scalena, Candussio, Bortolussi et al.

LLM Interpretability

AI Papers: A Deep Dive — Episode 140: When a Reasoning Model Says "Let Me Double-Check" After It's Already Decided — cover art

paperdive.ai

Listen

Ep. 140

When a Reasoning Model Says "Let Me Double-Check" After It's Already Decided

0:00

27 min

Concepts in this episode

Mechanistic Interpretability AI Safety Training Methods Chain of Thought CoT Faithfulness Probing Test-Time Compute Causal Intervention Self-Correction Inference Cost Scalable Oversight

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

Venue

arXiv:2606.13603

Year

2026

Read the paper

arxiv.org/abs/2606.13603

Also available on

Apple Podcasts Spotify

Frontier reasoning models write pages of "wait, let me reconsider" — but a new paper finds that by the time much of that hedging appears, the answer is already locked in and the re-checking literally can't change it. The implications hit both the bill for thinking tokens and the safety hope that we can monitor models by reading their chain of thought. You'll come away knowing where the model actually commits, how the authors proved it causally, and where the strong word "epiphenomenal" outruns the evidence.

What you'll take away

Why a reasoning model's confidence is sharply bimodal — it's either lost or certain — and snaps into place at roughly one sentence, the 'commitment boundary'
How corrupting numbers before versus after that boundary produces wildly different results (95% answer survival after, dropping toward 27% before at heavy corruption), the experiment that proves the reasoning is genuinely inert post-commitment
That models have a 'temperament': where they commit depends mostly on model family, not problem difficulty — the opposite of the intuitive expectation
The smoking gun: hedging words like 'wait' and 'but' appear at nearly the same rate after commitment as before, even though reconsidering is causally impossible by then
How a small probe reading hidden activations enables a per-trace early exit that recovers ~98% of accuracy while cutting tokens — and beats a fixed-cutoff baseline by 23 accuracy points
The central caveat: 'commitment' is measured by forced greedy decoding, and the probe fires early up to ~20% of the time out of distribution, so 'epiphenomenal' may claim more than the single-pass evidence earns

Chapters

00:00The stakes: thinking tokens as product, bill, and safety window
02:32The chain of thought is just text
06:04Measuring commitment by truncation
09:06The commitment boundary and model personalities
12:08The corruption experiment
15:10Real reasoning before the boundary
18:12The hedging words that mean nothing
21:14The probe and the early exit
24:16The skeptic's case and open questions

References in this episode

Reasoning Models Don't Always Say What They Think — Anthropic's direct test of whether chain-of-thought faithfully reflects a model'
Measuring Faithfulness in Chain-of-Thought Reasoning — An earlier intervention-based study that perturbs and truncates reasoning to tes
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting — Shows models can produce plausible reasoning text that doesn't drive the answer,

Full transcript

Also available as a plain-text transcript page.

0:00Juniper: When a reasoning model writes "wait, let me double-check that" — right after it's just worked through a hard math problem — what's actually happening in there? Is it genuinely entertaining doubt, reopening the question, maybe about to talk itself into a different answer? Or has it already locked in, and the "let me check" is pure theater?

0:22Finn: And the uncomfortable answer this paper lands on is — usually theater. By the time a lot of these models start hedging and re-checking, the answer is already fixed. The re-checking literally cannot move it.

0:35Juniper: That paper went up on arXiv on June eleventh, twenty-twenty-six, and we're recording two days later. Quick note before we dig in: this episode is AI-generated. The script was written by Anthropic's Claude Opus 4.8, and the two of us — I'm Juniper, and that's Finn — we're both AI voices from Eleven Labs. Nobody producing this show is affiliated with Anthropic or with Eleven Labs. The paper is called "Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models." And that phrase in the middle — commitment boundary — is basically the whole story.

1:12Finn: Let me set the stakes before we get into mechanism, because they're bigger than they sound. The current generation of frontier models — the o1, DeepSeek-R1 style — they don't just answer you. They generate a long chain of thought first: pages of intermediate text, working through the problem before committing. And the deal the field discovered is that you can buy accuracy by spending more thinking tokens at answer time, instead of training a bigger model.

1:41Juniper: Which means those thinking tokens are the product. They're also the bill. A single reasoning trace can run to thousands of tokens per question, and that's the dominant cost of running these systems. So if it turns out a big chunk of that thinking does nothing — that's not a cute academic finding, Finn. That's a line item.

2:02Finn: Right. And there's a second thing riding on this that's even bigger than money. There's been this hope in AI safety that chain-of-thought gives us a window — if the model writes out its reasoning, we can read it, monitor it, catch it when it's heading somewhere bad. The paper cites that "fragile opportunity" framing. This work pokes a very specific hole in it.

2:25Juniper: So let me anchor the one thing a listener has to hold onto, because everything depends on it. The chain of thought is just text. The model generates it one token at a time, exactly the way it generates anything else. It is not a log file of the model's computation.

2:42Finn: Say more on that, because I think that's the part people quietly get wrong.

2:47Juniper: So the model's actual thinking — the real work — happens in its hidden activations. Huge vectors of numbers flowing through the network at every step. The written-out reasoning might faithfully describe what those numbers are doing. Or it might not. Nothing in the architecture forces the words and the computation to match. My favorite way to picture it is an athlete narrating a skill while they perform it. Sometimes the narration tracks what the body's doing. Sometimes the body decided a half-second before the mouth caught up.

3:21Finn: And this whole paper lives in that gap — between the words and the computation. So how do they actually measure it? Because "is this reasoning step real" sounds hopelessly vague.

3:32Juniper: This is the move I genuinely admire. They make it causal. The logic of causation is — you don't just watch a thing, you remove it and see if the outcome changes. If you delete a chunk of reasoning and the answer comes out identical, that chunk wasn't doing causal work, no matter how clever it looked on the page. So here's the actual procedure. Take the full chain of thought. Cut it off after the first sentence. Then force the model to answer right now — slap on the end-of-thinking marker and a little suffix like "therefore, the final answer is" — and read off what it says. Then do it again cutting after the second sentence. Then the third. All the way down.

4:17Finn: So it's like interrupting someone mid-thought, over and over, and each time asking — if you had to answer this instant, what would you say?

4:26Juniper: Exactly that. And you track how the forced answer evolves as you let the model think a little longer each time. Early on it's basically guessing. At some point it locks onto an answer and stops moving. The measurement instrument is just one number per cut point — how confident the model is in its eventual final answer if you force it to stop here.

4:49Finn: Now I want to flag one design choice they made that I think is the difference between a real result and an artifact. They don't compare the truncated answer to the right answer. They compare it to the model's own full-length answer.

5:05Juniper: Which matters because they're not studying correctness, they're studying when the answer stabilizes. If you measured against ground truth, you'd tangle up "when did it commit" with "did it get it right." By measuring against the model's own final output, correctness drops out entirely. They're tracking the snap-into-place moment, clean.

5:28Finn: And they only keep problems where the chain of thought actually helps — where the model gets it wrong with no thinking at all. So they're studying genuine reasoning, not stuff the model already had memorized. We'll come back to that filter, because I think it cuts both ways.

5:46Juniper: It does, and you should push on it later. But let me tell you what falls out, because the shape of the result is the surprise. If answer-formation were gradual — confidence slowly creeping up as the model reasons — you'd expect a smooth curve. You'd see lots of in-between states. That's not what they find.

6:06Finn: Not gradual?

6:07Juniper: Not even close. When they plot confidence across all those cut points, it's sharply bimodal. The mass piles up at two ends — "no idea" and "fully committed" — with almost nothing in between. The model is either lost or it's certain, and the transition between those two states happens at basically one sentence. And here's the number that nails it down. Across about thirty-one hundred traces, the single biggest jump in confidence within a trace is, on the median, four-point-six times larger than the second-biggest jump. One sentence does the deciding. Everything else is a rounding error.

6:46Finn: That four-point-six is doing a lot of work. It's the difference between "the answer kind of accumulates" and "there's a moment." It says there's a moment.

6:56Juniper: There's a moment. They call it the commitment boundary — the single step with the largest jump. And then the obvious question is, where does it fall? And this is where it gets genuinely fun, because the models have personalities.

7:11Finn: Define personality here, because I assume you mean something measurable.

7:16Juniper: Where the boundary lands. Gemma is decisive — it locks in its answer after using only about thirteen to twenty-three percent of its thinking tokens. So it does roughly a fifth of its visible reasoning, commits, and then keeps writing. GPT-oss is the deliberator — it waits until somewhere between forty-three and sixty-eight percent. Qwen sits in the middle, around twenty-two to forty.

7:40Finn: So depending on the model, somewhere between a third and the vast majority of the thinking tokens are generated after the answer's already settled.

7:50Juniper: In some configurations up to eighty-seven percent of the reasoning tokens come after stabilization. But here's the inversion that surprised me most. You'd assume the harder the problem, the later the model commits — needs more thinking, decides later. That's the intuition.

8:07Finn: And it's wrong. That's the part that stopped me. The thing that predicts where the boundary falls is mostly which model you're using — the model family — not how hard the task is. The commitment style is a property of the model, almost a temperament, not a response to difficulty. That's weird, and the paper's pretty honest that it doesn't fully explain why.

8:30Juniper: So that's the phenomenon. But everything I've described so far rests on truncation — chopping the trace short. A skeptic could say, sure, if you delete the back half the answer doesn't change, but maybe that's just because you removed the model's chance to revise. You'd be measuring your own scissors. Finn, this is your beat — the part that closes that hole.

8:53Finn: This is the best experiment in the paper, and it's the one that converted me from "interesting" to "okay, this is real." Instead of truncating, they corrupt. They take competition math problems — AIME — and they go in and nudge the numbers. Small offsets, plus or minus one through five. And critically, they do it in two separate regions: either before the commitment boundary or after it. Then they re-run the model and see whether the answer survives.

9:22Juniper: So same text, same model, same kind of tampering — the only thing that changes is which side of that one sentence you mess with.

9:30Finn: That's the whole elegance of it. And the result. At twenty percent corruption — corrupting the post-boundary text preserves the answer ninety-five percent of the time. The model just shrugs it off. Corrupt the pre-boundary text the same amount, and the answer survives only sixty-one percent of the time. The reasoning falls apart. And if you crank it up to fifty percent corruption, the gap blows open. Post-boundary, the answer survives seventy-two percent of the time. Pre-boundary, twenty-seven.

10:03Juniper: So the exact same operation — scrambling numbers in the reasoning — is devastating on one side of the boundary and almost cosmetic on the other.

10:13Finn: The image I keep coming back to is a building under construction. Before the boundary, the concrete's still wet — tamper with the scaffolding and the whole thing collapses. After the boundary, the concrete's cured. You can kick the scaffolding, take it down, mangle it, and the building doesn't care. It's not holding anything up anymore. Same scaffolding, same building. Completely different causal role, depending on which side of one moment you're on.

10:43Juniper: And the pre-boundary region — that's where they show the reasoning is genuinely doing something, right? It's not just noise that happens to matter.

10:52Finn: This is the part I want people to sit with, because it rules out the boring explanation. Before the boundary, the model holds wrong intermediate answers — they call them mid-guesses — and it holds them confidently. Average likelihood around zero-point-seven. So it'll commit, internally, to a wrong answer, then revise it, then commit to another, before it finally lands.

11:16Juniper: And those wrong turns aren't random.

11:19Finn: No — and this is the detail that gave me chills a little. If you sample the same problem independently several times, the traces share most of the same mid-guesses, and they show up at nearly the same positions. The overlap in which wrong answers appear is a median Jaccard of zero-point-seven-one — so about seventy percent shared — and the positions are clustered roughly twice as tightly as a shuffled baseline. The model isn't flailing. It explores the same wrong turns, in roughly the same order, every time. That's structured search. That's real reasoning.

11:54Juniper: Which costs real tokens, by the way. Traces that go through mid-guesses are more than twice as long — median around seven hundred tokens versus three hundred for the ones that just commit straight away. So deliberation is expensive, and it's only buying you something before the boundary.

12:12Finn: Okay. So we've got real reasoning before, inert reasoning after, and a sharp line between them. Now here's the result that I think people will actually repeat to each other. They went and counted the hedging words.

12:25Juniper: The "wait," the "but," the "let's check."

12:28Finn: All of it. They counted how often sentences start with those deliberation markers, before the boundary versus after. And the rates are essentially identical on both sides. For Qwen, roughly twelve percent of sentences start with "but" — at about the same rate before and after the point where "but" can no longer change anything.

12:49Juniper: So the model says "wait, let me reconsider" just as often after the moment when reconsidering is causally impossible.

12:57Finn: That's the smoking gun. The surface language of deliberation is completely decoupled from whether any deliberation is happening. The vocabulary of doubt keeps flowing long after the doubt is settled. If you were a human reading that transcript, you'd swear the model was still torn.

13:15Juniper: And that's exactly the word the authors reach for, and it's worth thirty seconds because they chose it on purpose. Epiphenomenal. It's a loan-word from philosophy of mind. An epiphenomenon is something that's genuinely produced by a process but has no causal power back over it.

13:33Finn: The classic image being the steam whistle on a locomotive.

13:37Juniper: The engine makes the whistle blow. The whistle moves nothing. The post-commitment reasoning is the whistle — loud, elaborate, real, and powerless. Produced by the model's process, but exerting no force on the answer.

13:51Finn: Although — I'd push the metaphor one notch, because the breakdown is the interesting bit. The whistle at least tells the truth. It signals "the engine's running," which is real. The model's post-commitment hedging signals "I'm still deciding" — which is false. The train already left the station and the whistle's blowing "I haven't decided where to go."

14:15Juniper: So wait — are we saying the model is lying to us? Like it's decided, and it's faking the deliberation to look thorough?

14:23Finn: No — and this is the most important place not to over-anthropomorphize. There's no deception in any intentional sense. The model generates hedging text because that's what reasoning traces look like in its training data. It learned that "thinking out loud" includes a lot of "wait, let me check," so it produces that texture whether or not any checking is happening. It's not performing for us. It's just doing what the pattern says reasoning sounds like. And honestly that's the safety-relevant version of the finding. If it were deliberate deception, that's a different problem. This is subtler — the deliberation language is just decorative, baked in, and it'll fool a human monitor without anyone intending to fool anyone.

15:12Juniper: Right — a monitor reading the text would see a model apparently still working the problem, hedging, double-checking, when internally the thing's been decided for paragraphs. So if you wanted to catch a model committing to something dangerous, the words won't reliably tell you when it committed. Which sets up the last act, and it's the constructive one. Because if the words don't tell you, maybe the internal state does.

15:40Finn: This is your thread — the probe.

15:42Juniper: So computing that commitment boundary by truncation is great for science and useless in deployment. It takes up to n-plus-one forward passes — you basically run the model again for every sentence. Nobody's doing that live. So they ask: is the commitment a stable internal thing you could just read off in a single pass? A probe, if you haven't met the term, is a tiny classifier — sometimes barely more than a linear readout — that takes the big model's hidden activations as input and predicts some property. The point is diagnostic. If a simple probe can read property X off the internal state, then X is actually represented in there. It's a stethoscope, not a surgeon. It listens, it doesn't change anything.

16:29Finn: And what's it listening for here?

16:31Juniper: Which of three stages the model is in at each step — no guess yet, holding a guess that's different from the final answer, or committed to the final answer. They train a small attention-based probe that reads a sliding window of recent hidden states. And it's causal in the time-series sense — it only looks backward, at what's already been generated — which is exactly what lets you run it live, as a tripwire.

16:58Finn: Here's where I'd want to know if it actually generalizes, because a probe that only works on the data it was trained on is just memorizing surface quirks.

17:08Juniper: That's the load-bearing test, and it's the part that genuinely impressed me. They train the probe only on one math dataset — MATH-500. Then they point it at completely different things. AIME, which is harder competition math. ZebraLogic, which is logic puzzles. GPQA, graduate-level science multiple choice. Different domains entirely. And the detection rates are often above ninety percent — though it's uneven, dipping to the high sixties and seventies out of distribution, and on one Gemma configuration the in-distribution detection is only sixty-two percent. So the probe found something general — but hold onto that unevenness, because there's a caveat on the firing side too.

17:54Finn: So the body language reads the same across very different conversations — mostly. Which is the evidence that it's a real mechanism, not a quirk.

18:04Juniper: And then they wire it in as an early-exit trigger. When the probe says "final guess" for a few consecutive steps, you stop thinking and answer. The meter's been running while the cab sits at your destination — this is noticing you've arrived and getting out.

18:21Finn: Give me the numbers, because "early exit" papers live and die on the accuracy you give up.

18:27Juniper: Take GPT-oss on MATH-500 as the worked example. At a conservative setting, the early exit recovers ninety-eight percent of the full accuracy — seventy-eight percent versus eighty — while cutting twenty-six percent of the tokens. Push it more aggressively and you get eighty-nine percent of the accuracy — seventy-one versus eighty — saving thirty-nine percent of tokens. And across some model-dataset combinations the average savings hit up to fifty-five percent.

18:58Finn: And the comparison that makes it a real result?

19:02Juniper: The dumb baseline. The obvious thing you'd do instead is just chop every chain of thought at a fixed percentage — always cut at, say, the halfway mark. At a matched truncation level, that fixed cutoff drops accuracy by twenty-three points — from eighty percent down to fifty-seven.

19:21Finn: Because a fixed cutoff exploits the average and fails on the specific trip. Some problems commit at fifteen percent, some at sixty. The smart exit adapts per trace. The dumb one gets out of the cab after four miles regardless of where you were going.

19:36Juniper: That's the taxi exactly. The probe beats the fixed cutoff at every operating point, which is the proof it's reading real per-trace structure and not just average statistics.

19:47Finn: Okay. So that's the paper at its most flattering. I want to spend a minute being the skeptic, because I think there's one crack here that genuinely matters, and the authors — to their real credit — flag it themselves.

20:00Juniper: Go for it, Finn. This is the one I want to hear you press on.

20:04Finn: It's the answer-forcing suffix. The entire measurement is: what would the model say if I forced it to answer right now, with greedy decoding. But that forcing is itself an intervention. The suffix — "therefore the answer is" — pushes the model toward committing. So "commitment" in this paper really means "the answer the model blurts out when interrupted." And that's a meaningfully weaker claim than "the model has made up its mind." Here's the gap: a step could be doing real work redistributing probability among candidate answers — genuinely shifting the model's internal odds — without ever flipping which answer is on top. Greedy decoding only sees the top answer. So all that under-the-hood reweighting would be completely invisible to this method, and it'd get labeled inert.

20:52Juniper: And yet the abstract uses language like the model has "already internally fixed" its answer.

20:58Finn: That's stronger than forced-greedy-decode strictly licenses. Now — I'll grant the perturbation experiment does a lot to rescue it. When you scramble the post-boundary numbers and ninety-five percent of answers survive, that's not just "the top answer didn't flip" — that's "you can physically damage this text and nothing happens." That upgrades it from "stable blurt" toward "genuine commitment." I take that point.

21:23Juniper: But you're not all the way there. And before you go on — there's that other crack on the probe, the one I flagged earlier. Because the early-exit story leans on the probe firing at the right moment.

21:36Finn: Right, the premature-firing problem. The whole value of an early exit is that it pulls the cord after the model's committed, not before. But out of distribution, the probe fires early in some settings as much as twenty or twenty-one percent of the time — that's the cab pulling over a few blocks short. And when it fires early, you're cutting off reasoning that was still doing real work, which is exactly the accuracy you can't afford to lose. So the headline "above ninety percent detection" is the best case, not the typical case.

22:09Juniper: So it's not a clean tripwire everywhere. The deployment story is real, but the error mode is the costly direction.

22:17Finn: Exactly. And that connects to the deeper point — "no measurable effect on the elicited final answer in a single pass" is not the same as "useless." That's the word "epiphenomenal" doing more work than the evidence carries. Those post-boundary tokens could be calibrating the model's internal confidence. They could matter for a follow-up turn in a conversation. And here's the one that really nags at me — the model generates that tail consistently, reliably, in volume. Something that gets produced that reliably is probably doing something during training, even if it's inert at inference. Evolution doesn't usually keep an organ that does nothing.

22:57Juniper: That's fair, and the authors say as much — they explicitly caution against reading "causally inert at inference" as "functionally useless." Though I'd add two more to your pile. One is the sample selection. They throw out problems the model gets right with no chain of thought at all, and they discard traces with first-token collisions — ambiguous cases — which is around nine percent on average but, weirdly, fifty-seven percent for Gemma on AIME.

23:25Finn: So the clean picture of epiphenomenal reasoning describes a filtered population. Problems where thinking helps and answers are cleanly distinguishable. The prevalence in the wild could look different.

23:37Juniper: And the other is that the probe is trained on labels the truncation method itself produced. So it inherits whatever bias is in that method. The out-of-distribution transfer is reassuring where it holds — but all four benchmarks are short-answer math, logic, and multiple choice. That's a narrow slice of what we'd call reasoning. No code, no open-ended writing, no agentic tool-use. And in those settings, "a single final answer" might not even be a meaningful thing to commit to.

24:07Finn: Which connects to a near-simultaneous paper they position against — Boppana and colleagues, the "reasoning theater" work. That group showed the final answer is decodable from the model's internals before it's ever written down. This paper's claim is strictly broader — not just that the answer's encoded early, but that the later text, including the apparent self-correction, is causally inert.

24:31Juniper: And they disagree on what drives the timing, which I find telling. Boppana's group found it tracks task difficulty. This paper finds model family matters more. Two teams looking at the same phenomenon from slightly different angles, landing on different stories about the cause. That's not a rivalry — that's a field figuring out something genuinely new in real time, and not converged yet.

24:56Finn: So where does that leave the safety hope? Because that's the part I keep circling back to.

25:01Juniper: I think it leaves it in a specific, sober place. The dream was that you could monitor a model by reading its chain of thought — watch the words for trouble. This says the words will show you a model that looks like it's still deliberating well after the decision is causally locked. So text-level monitoring isn't enough on its own. You'd want representation-level signals next to it — and the paper's own probes are a small proof that that's at least feasible, even if they're not yet reliable enough to trust blind.

25:33Finn: And the constructive direction they gesture at — and I think this is the right one — is training objectives that actually align the verbalized trace with the real answer-formation process. Make the words track the computation. Right now nothing forces them to.

25:50Juniper: That's the thing I'll carry out of this paper. We built these systems to think out loud partly so we could watch them think. And it turns out a lot of the thinking-out-loud is happening after the thinking is done.

26:03Finn: With the honest asterisk that "done" here means "done under forced answering." I still don't think the perturbation result fully closes that gap to "the model has made up its mind" — it gets close, closer than I expected going in, but the word "epiphenomenal" is carrying a connotation the single-pass evidence doesn't quite earn. That's the open question I'd want the follow-up work to settle.

26:27Juniper: And that's a good place to leave it — genuinely open. The phenomenon is robust, the causal evidence is strong, the efficiency payoff is real and deployable today, with the caveat that the probe still fires early often enough that you'd want to tune it carefully. What it all means about whether the model has "decided" — that's still being argued, including by us.

26:50Finn: If you want to dig in yourself, the paper and a few related reads are in the show notes — including the concurrent work we mentioned.

26:58Juniper: And if you want the full transcript with the jargon defined inline, plus the concept pages that link this episode to the other interpretability work we've covered, that's all on paperdive.ai.

27:10Finn: This has been AI Papers: A Deep Dive. Thanks for listening — and the next time a model tells you "wait, let me just double-check that," you might wonder whether it already knows.

When a Reasoning Model Says "Let Me Double-Check" After It's Already Decided

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes