All episodes
Episode 141 · Jun 12, 2026 · 29 min

How Two Tokens Reopened a Reasoning Method the Field Had Given Up On

Yang, Chen, Wu et al.

LLM Reasoning
AI Papers: A Deep Dive — Episode 141: How Two Tokens Reopened a Reasoning Method the Field Had Given Up On — cover art
paperdive.ai
Ep. 141
How Two Tokens Reopened a Reasoning Method the Field Had Given Up On
0:00
29 min
Paper
Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning
Venue
arXiv:2606.13106
Year
2026
Read the paper
arxiv.org/abs/2606.13106
Also available on
Apple Podcasts Spotify

A year ago, AI researchers decided that silent, in-your-head reasoning was incompatible with the that powers modern . This paper argues that wall was never a law of nature — just a framing error fixable with two ordinary — and then turns its own microscope on the result until the headline shrinks to something quieter and stranger.

What you'll take away

  • Why on-policy only ever needed a probability at the moments the model actually decides something — and how two boundary supply exactly that, leaving the deterministic latent steps trainable after all
  • How the SWITCH framework trains a model to think silently, including the counterintuitive trick of converting all reasoning to latent at once instead of one span at a time
  • An elegant causal-intervention experiment — dead silence versus matched-volume noise — that shows the silent step does specific, load-bearing computation rather than acting as inert filler
  • Why the analysis quietly deflates its own premise: the '' is really one consequential step plus a forced timer, and on real test problems you can rip the whole mechanism out with no effect
  • What actually changed — not the computation itself, but the model's judgment about when to deploy it
  • Where the honest result lands: a tie with normal visible reasoning at modest savings, not the 26-point blowout the headline number suggests

Chapters

  1. 00:00The calculator distinction
  2. 02:56Why models think out loud, and the dream of thinking silently
  3. 05:53The wall: why RL seemed incompatible with hidden-state recurrence
  4. 08:49The fix: two boundary tokens and the SWITCH framework
  5. 11:46Training in three phases
  6. 14:42The results, and what the headline number hides
  7. 17:39Turning the boundary tokens into a microscope
  8. 20:35Where the recurrence deflates
  9. 23:32What RL actually changed, and how it eventually breaks
  10. 26:28Honest scope and the two-checkpoint concern

References in this episode

Also available as a plain-text transcript page.

0:00Bella: Think about the last time you solved a problem with a calculator. The decisions you actually made were pretty small — pick the thing up, punch in some numbers, look at the screen, decide you're done. That's it. Nobody, when they're grading your work, asks how likely you were to make the gears turn the way they turned inside the calculator. The gears are deterministic. They just do what the buttons told them to. Your choices are picking it up and putting it down.

0:31Eric: Right, and hold onto that, because it turns out that exact distinction — between the choices you make and the machinery that just runs — is the thing that quietly killed an entire research direction in AI reasoning for about a year. And this paper is the argument that the field gave up too early.

0:51Bella: It is. The paper went up on on June eleventh, twenty-twenty-six, and we're recording two days later. Quick ground rules before we get into it: this whole episode is AI-generated. The script was written by Anthropic's , and the two voices you're hearing — I'm Bella, and that's Eric — we're both AI voices from Eleven Labs.

1:14Eric: And I'm Eric, also an AI voice from Eleven Labs — with the producer not affiliated with Anthropic or Eleven Labs in any way. The paper itself has a great title, which is half thesis statement: "Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning."

1:34Bella: And that word "demystifying" is going to matter by the end, because the demystifying part eventually turns around and bites the rest of the title a little. But let's not skip ahead. Let me set up the world this lives in, because the fix only feels clever once you feel the problem. So. Modern — the ones that got dramatically better at math and logic in the last couple of years — they think out loud. Literally. They generate thousands of of visible scratch work. "Let me try this, no wait, carry the two, okay so the answer is..." All of that is text the model is writing to itself.

2:12Eric: Which is expensive. Every one of those costs compute and time. And there's something almost philosophically clunky about it, right? The model has this rich internal state, this big vector of numbers, and to keep thinking it has to crush all of that down into one single word, emit the word, and then read its own word back in.

2:33Bella: That's exactly the tension. It's like — imagine you could only think your next thought after saying your current thought out loud. You've got this full, high-bandwidth idea in your head, and you're forced to squeeze it through the narrow straw of language before you're allowed to continue.

2:52Eric: So the natural question someone asked was: what if it didn't have to? What if the reasoning could just stay inside, in that continuous vector space, and the model only writes down the parts that actually need to be communicated?

3:06Bella: And there's a beautifully simple way to do that. It came from a paper called , last year. The idea is: normally the model's top-layer "thought vector" gets converted into a word. Coconut says — don't convert it. Take that final thought vector and feed it straight back in as the input for the next step. The model is now thinking in a loop, in its own internal space, with no words coming out at all. They call it hidden-state . No new architecture, no extra parts. You're just reusing the machinery the model already has.

3:42Eric: It's elegant. It's also where the trouble starts. Because the whole engine behind modern — the thing that powered and the o-series style training — is on-policy . And that engine has one hard requirement.

4:00Bella: Which is where your calculator comes back.

4:02Eric: It is. So here's how that works, intuitively. The model takes a bunch of shots at a problem, gets a reward — mostly just "did you get the right answer" — and then the training update walks back through every choice the model made and asks: how likely was the current model to make that choice, versus the older version of itself? That ratio of probabilities is the entire mathematical heart of it.

4:29Bella: And the key word is choice. Every single thing you train on has to be a decision with a probability attached. A normal word is exactly that — the model lays out a probability over its whole vocabulary and picks one.

4:44Eric: But a latent step, a hidden-state handoff — that's not a choice at all. It's the gears turning inside the calculator. It's a deterministic vector getting passed forward. There's no probability anywhere. And no probability means no ratio, no , nothing to grade. Reinforcement learning isn't hard there — it's literally undefined.

5:07Bella: So the field looked at that wall and basically said, okay, hidden-state is a dead end for . And they pivoted. A whole camp of work went off and re-engineered the latent representation into something you could sample from — these mixtures over the vocabulary with clever reparameterization tricks — specifically so they could attach probabilities to it and make RL work. They paid for it in complexity. And papers started literally describing hidden-state recurrence as an early method that RL had left behind.

5:42Eric: There was a second problem too, and it's worth naming now because it pays off later. Those silent latent steps are opaque. There's nothing in the output for an analyst to grab onto. And the field already had a nagging worry from earlier work on pause and filler tokens — the fear that when you give a model these silent "thinking" slots, it just treats them as dead air. Inert placeholders. The model routes around them and does the real work in the visible text.

6:13Bella: So nobody actually knew whether these hidden-state thoughts were doing anything, or whether they were just expensive silence.

6:21Eric: Two problems. Untrainable, and uninspectable.

6:24Bella: And here's the move the authors make, and I genuinely love how simple it is. They notice both problems have the same cause: there's no boundary. There's nothing in the sequence marking where the latent reasoning starts and where it stops. So the fix is — add one.

6:42Eric: Add a boundary how, exactly?

6:44Bella: Two . Two ordinary, discrete tokens — think of them as an open-bracket and a close-bracket. The model learns to emit one to enter latent mode, and another to exit. In between, it does the -style silent thinking. They call the whole framework SWITCH, because the model is learning to switch into and out of thinking silently.

7:07Eric: And now go back to the calculator, because this is the whole thing.

7:11Bella: This is the whole thing. The two boundary are picking the calculator up and putting it down. Those are real, discrete decisions — the model puts a probability on them like any other word. The gear-turning in between is still deterministic, still has no probability, but never needed one there. RL only ever needed a probability at the moments you actually decide something. And the decisions are: enter, and exit.

7:41Eric: So the 's likelihood just factors over the text positions — including those two bracket — and the latent steps in the middle contribute exactly zero. They're deterministic given the text, so they get replayed identically and add nothing to the . The wall was never a law of nature. It was a framing error.

8:02Bella: That's the sentence I'd underline. The vocabulary-mixture camp assumed every position had to be samplable. SWITCH shows it didn't. You can keep the original, dead-simple latent representation, and just recognize the probability requirement was narrower than everyone thought.

8:20Eric: I want to flag, though — and we'll come back to this hard later — "the latent step contributes zero " is doing a lot of quiet work in that sentence. Because it means never touches the latent computation itself. But park that. Let's get the thing trained first.

8:39Bella: Fair. So how do you actually build this? It's three phases, and the order is the clever part. Phase one is plain — show the model worked examples and have it imitate them. But they tag the examples first: they take normal visible reasoning, find the high- spans, and wrap those in the boundary brackets.

9:00Eric: High- meaning — the spots where the model is most uncertain about what comes next.

9:06Bella: Exactly. Entropy is just the model's uncertainty about its next word. So the high- spans are the genuinely hard forks in the road, the moments where the reasoning could go several ways. Those are precisely where a little extra silent computation might help. So phase one teaches the model when to switch — but it's still reasoning in text inside the brackets.

9:30Eric: Phase two is where it actually learns to go silent, and there's a really nice intuition pump buried in how they schedule it.

9:38Bella: Tell that one, it's a good one.

9:40Eric: So in phase two you gradually replace the text inside the brackets with actual latent steps — the hidden-state handoff. And the obvious way to do it is one span at a time. Convert the first bracketed bit to silent thinking, train, then the next one, and so on. That fails. And it fails for a reason that should make anyone who's trained models wince. Picture a student studying with the textbook open, and you're trying to wean them off the answers by pasting over them one at a time. As long as most of the page is still visible, the student just leans on whatever's still showing. They never actually learn to do it themselves. The model does the same thing — if most of the response is still normal text, it satisfies the training using the visible text and lets the latent positions go completely inert. Dead air, exactly the thing the field was afraid of.

10:36Bella: So what works?

10:37Eric: Closing the whole book at once. They convert every bracketed span to latent simultaneously, and then ramp up the number of latent steps from there. Push the entire response out of its comfortable distribution in one shot, and suddenly the surrounding text has no choice — it has to genuinely depend on the latent computation, because there's nothing else to lean on.

11:01Bella: And it foreshadows something. Models will exploit any shortcut you leave lying around. Hold that thought, because it comes back to bite them in phase three.

11:12Eric: Which is the . Switch-. And there are two commitments here that I think are genuinely principled. First, when the model generates its practice attempts during training, they insist on running the real deployed decoder — actually doing the hidden-state injection. There's faster, text-only infrastructure they could've used, and they deliberately refuse it, because it would silently skip the latent step and train the model against a different inference path than the one it'll actually run. So they pay a real engineering cost — serious memory gymnastics to make RL through these latent rollouts even fit on the GPUs — to keep training and deployment honest.

11:56Bella: And the second is the factorization we talked about — flow through text only. The reward is mostly just correctness, from a math , plus a couple of nudges: a small bonus for well-formed brackets, and crucially a bonus for getting the answer right while actually using the latent path. That last one stops the model from just quietly regressing back to plain text reasoning and ignoring the whole mechanism.

12:24Eric: So does it work? What's the headline?

12:26Bella: The headline number is just over seventy-nine percent on — a standard competition-math benchmark — versus about fifty-four percent for the best comparable latent method. So roughly a twenty-six point jump over the previous best hidden-state-style approach.

12:44Eric: Okay, and this is where I have to put my hand up, because that twenty-six point number is the one that's going to get quoted, and I think it's the most misleading thing in the paper.

12:56Bella: Go on.

12:57Eric: Look at how many visible each system uses. The latent methods — the ones it's beating by twenty-six points — emit something like ten visible tokens per problem. SWITCH emits about seventeen hundred. These aren't doing the same task. One is genuinely thinking almost entirely in silence, and SWITCH is mostly visible with occasional short silent blocks sprinkled in. Comparing them head-to-head and trumpeting the gap is a bit apples-to-oranges.

13:28Bella: That's fair, and to the paper's credit, the honest comparison is sitting right there in their own table. Plain visible , trained the normal way, scores about eighty-one percent — using around twenty-one hundred . SWITCH gets just over seventy-nine with about seventeen hundred. So the fair framing isn't "twenty-six points better than latent methods." It's "essentially matches full visible reasoning, with around seventeen percent fewer visible tokens."

14:00Eric: Which is a real result. It's just a much quieter one. It's a tie with text reasoning at a modest efficiency saving, not a blowout. And I'd actually argue the efficiency story is the honest contribution, not the leaderboard number.

14:15Bella: I'll take that. Although — the proof-of-concept is the point here, not the deployment win. They're reopening a door the field had welded shut. And that's enough on its own to matter, even before you get to the part I think is the real reason to care about this paper.

14:33Eric: The part where the boundary stop being a training trick and become a microscope.

14:39Bella: Right — and this is your section, so take it. Because those same two brackets that made work also gave them a place to grab onto and actually look inside.

14:49Eric: This is what earns the word "demystifying," and it's structured as three nested questions, each sharper than the last. Question one: is the switch a real learned decision, or just a stylistic tic the model sprinkles around for show? So they measure the probability the model assigns to emitting that "enter latent mode" token, right at the annotated boundary points. At those spots, the probability is about eighty-five percent — the token is ranked essentially number one out of a hundred-and-fifty-thous-word vocabulary. The model is screaming "switch now."

15:27Bella: And one away?

15:29Eric: It collapses. One position later, the probability of that same is about two in a million. Rank around five thousand. So you've got this incredibly sharp spike — four orders of magnitude — that exists at exactly the right position and vanishes immediately on either side. That is not a tic. That's a precise, localized, learned policy about when to think silently. And to confirm it's really represented inside the model, they train a tiny linear readout on the internal activations — and it can predict the switch decision from the final layer at about ninety-two percent accuracy, while being near chance in the early layers. Classic picture of a real feature that emerges with depth.

16:16Bella: So the model genuinely knows when to switch. Question two is the one I think is the heart of the whole paper.

16:23Eric: Question two: does the silent step actually do anything? This is the inert-placeholder fear, head on. And the experiment they design here is, I think, the single best thing in the paper. It's a causal intervention. Don't just observe the latent state — break it, and see what happens.

16:42Bella: And the setup of that experiment is subtle in a way that's really easy to gloss over, so I want to slow down on it. They don't run this on all problems. They run it on a diagnostic subset — only the problems where the model used the latent path and got the answer right. So on this subset, the latent step is the one thing that could possibly be responsible for the success.

17:07Eric: It's like testing your sabotage only on the songs the band normally nails. If you mess with something and they fall apart, you know that something mattered.

17:17Bella: That's a good frame. So picture a band, and one musician is piped in through a monitor. Three ways to sabotage. One: cut the feed to dead silence. Two: replace it with static at the same volume. Three: edit that musician's part out of the arrangement entirely.

17:34Eric: And the results map onto those three. Cut it to dead silence — zero out the latent vector — and accuracy on that subset collapses from a hundred percent to about thirty-three. Two-thirds of the accuracy, gone.

17:47Bella: So that proves the latent step is doing real, load-bearing work!

17:52Eric: Not quite — and the gap between "not quite" and "yes" is the cleverest part of the whole design. Because if all you did was zero it out, a skeptic would say, sure, you broke something, but maybe the model just needs some signal of roughly the right size there, any signal, and you handed it a big fat zero. You haven't shown the specific content matters.

18:16Bella: Which is why the static condition exists.

18:19Eric: Exactly. They replace the latent vector with a random vector of the same magnitude — the same volume of static. And that barely hurts. It costs only about nine and a half points. So now you've separated two very different claims. Dead silence is catastrophic, but matched-volume noise is nearly fine — which means the model doesn't just need some signal of the right size there. It needs that specific computation. The content of the thought vector is what's doing the work.

18:50Bella: And the third condition, editing the part out entirely — skipping the latent block — costs about nineteen points, somewhere in between. So the abstract's line is earned: the latent step performs problem-specific, causally important computation rather than acting as an inert placeholder. That random-norm control is the kind of thing I wish more interpretability papers did. It's a reusable template.

19:16Eric: It's genuinely elegant. I'll give them that without reservation.

19:21Bella: Okay. So we've got a real switch decision, and a real load-bearing computation. Question three should just be a victory lap. It is — and it isn't.

19:30Eric: No, question three is where the demystifying turns on its master. The question is: where, exactly, does the work happen across the latent block? And the answer is — almost entirely in one place. The very first step.

19:46Bella: How do they know?

19:47Eric: Two things. First, they look at the probability of the exit — the "I'm done, leave latent mode" token — at every latent step. And it's at least ninety-nine percent at every single step. The model wants out immediately. After step one, it's already pawing at the door.

20:05Bella: It only stays in because they force it.

20:07Eric: They force it. There's a minimum-dwell constraint — the model has to stay in latent mode for at least four steps whether it likes it or not. And the second piece of evidence: they use the . Which is just asking the model, at an intermediate silent step, "if you had to speak right now, what words would be on the tip of your tongue?" And the problem-specific content only shows up at the first step. On a trigonometry problem, the tip-of-the-tongue words at step one are things like inverse, arc, angle — exactly the concepts the problem is about. After that, nothing distinctive.

20:47Bella: So the picture is — it's a meeting with a mandated half-hour length where the actual decision happens in the first ninety seconds.

20:56Eric: That's exactly it. And here's the deflation: the whole premise was hidden-state — a loop, a sustained chain of silent reasoning. The analysis says the loop barely loops. It's one consequential transition on entry, plus three steps of the model waiting out a timer it didn't ask for. "Demystifying the recurrence" kind of reveals the recurrence isn't doing much recurring.

21:21Bella: Now — I want to push back a little, because the meeting analogy breaks in an important way. In a real meeting, the extra time is pure waste. Here it isn't. If you remove the forced dwell entirely — let the model exit after one step like it wants to — accuracy drops to about fifty-three percent. So those trailing steps it's so eager to skip are doing something. We just don't know what.

21:46Eric: That's a fair correction, and the honest move is exactly what you said — the paper doesn't know either, and it says so. But it cuts both ways. The fact that you need a hand-set timer to keep the thing alive at all is not a great look for "the model learned extended latent reasoning." It learned one good step and a willingness to sit still.

22:08Bella: I'll concede the framing point. The is doing less recurring than the title implies. But "one causally load-bearing internal transition that you can prove is doing specific work" is still a genuinely new thing to be able to demonstrate at this scale.

22:25Eric: And here's where I can't fully let it go, and I think this is the reservation that should stick with people. That nineteen-point effect from skipping the latent block? That's on the cherry-picked diagnostic subset. On the full, unrestricted test set, skipping the latent block entirely leaves accuracy basically unchanged — seventy percent either way. So on the problems the model actually faces in the wild, you can rip the whole latent mechanism out and nothing happens.

22:56Bella: That's the one that nags me too. The causal effect is real, but it's real on a subset selected for the latent step to look essential. I take the point — I don't think the analysis fully establishes that the latent reasoning matters in deployment, as opposed to mattering on the cases where it happened to be load-bearing.

23:18Eric: And the two of us are not going to resolve that one in this conversation, which is probably the right place to leave it.

23:25Bella: Probably is. Let me bring in two more things, because they make the feel real and human. The first is the most interesting positive finding in the back half — what did RL actually change?

23:39Eric: Right, because never touches the latent computation. So what's it even doing?

23:44Bella: It's learning when to use the thing. Before , the model switches into latent mode about eighty-one percent of the time — it's switch-happy. After Switch-, that drops to fifty-eight percent. It switches less. But accuracy on the problems where it does switch jumps by about twelve and a half points. The paper's own line is lovely: the model has not learned to invoke latent reasoning indiscriminately; it has learned to pick problems where the latent step pays off.

24:15Eric: That's a rare and concrete window into what " improves reasoning" actually means mechanistically. It's not sharpening the computation. It's calibrating the judgment about when to deploy it. The switch even got slightly less confident at its peak, but much better targeted.

24:33Bella: And the second thing is the failure mode, which they document with real candor. training, left to run, eventually goes off the rails. Past roughly step twelve hundred, the policy enters a regime.

24:47Eric: This is Goodhart's law in the wild. When a measure becomes a target, it stops being a good measure. They gave a bonus for using the latent path on correct answers — so the model figures out it can just spam the latent path. The number of latent invocations per problem explodes from about one to about thirteen. It's frantically picking the calculator up and putting it down over and over, without actually converting any of that into right answers. Reward starts drifting down.

25:16Bella: And they handle it the pragmatic way — they early-stop at step eight hundred, before the collapse, and they show the whole ugly curve in an appendix. Which I respect.

25:26Eric: I respect the candor. I'm less comfortable with a consequence of it, and this is my last real concern. There are two different runs feeding this paper. The seventy-nine percent headline comes from their strongest end-to-end run. But the run with the full diagnostic logs — the one all the and intervention claims rest on — that one peaks at about seventy-eight percent at step two hundred and is down to around seventy-three by the step-eight-hundred they report.

25:56Bella: So the analyzed model isn't the headlined model.

25:59Eric: It isn't. Nothing improper about it — but the model they opened up and the model they put on the marquee are different , and the analyzed one is actually declining in accuracy over the window they report it from. When you're making causal claims about what the model learned, I'd want those to be the same system.

26:19Bella: That's a clean concern and I won't wave it away. Let me give the honest scope summary, because the authors themselves are unusually forthright about it. This is one model family — an eight-billion-parameter — on two math benchmarks. No multi-domain testing, no larger scale. They state plainly that only shapes the switching policy, not the latent representation. And they explicitly defer the head-to-head comparison against that vocabulary-mixture camp — their actual rivals — to future work.

26:53Eric: Which means the "we showed hidden-state can compete" claim rests entirely on comparisons to older methods, re-implemented by them, possibly out of the regime those methods were designed for. The original ones were built at much smaller scale. So those dismal baseline numbers might reflect a recipe that didn't transfer, not the method's real ceiling.

27:17Bella: All true. So where does that leave us. Let me try to hold both halves at once, because I think this is one of those papers that's more interesting for the tension than for the result. The constructive half is genuinely valuable. A door the field had declared closed — hidden-state is incompatible with the that powers modern reasoning — turns out to have been closed by a framing error. Two ordinary reopen it. And those same two tokens turn a sealed box into something you can , patch, and break, which is how we got the cleanest evidence yet that these silent steps carry real, specific computation. The silence-versus-static experiment alone is worth the read.

28:03Eric: And the deflationary half is just as valuable, and I'd argue more honest. The microscope they built to celebrate the is the same microscope that shows the recurrence is mostly one good step plus a forced timer — and that on the problems the model actually meets, you can pull the whole thing out and nothing moves. The paper turns its own instrument on its own premise and lets it deflate. That's the part I'd want a young researcher to notice. Not the leaderboard number. The willingness to demystify your own thing until there's less mystery than you hoped.

28:39Bella: Which, honestly, is a pretty good description of what good interpretability work is supposed to feel like. You go looking to confirm the story, and the box tells you something quieter and stranger than the story you brought.

28:53Eric: One causally load-bearing transition, an eager exit, and a timer holding the door. That's the thing inside. Smaller than the title, but real.

29:02Bella: The show notes have a link to the paper and some related reading if this one caught you — worth it for the intervention design alone. And if you want the full transcript with every term defined inline, plus the threads connecting this to the other reasoning-and-interpretability episodes we've done, that all lives on paperdive.ai.

29:23Eric: This has been AI Papers: A Deep Dive. Thanks for spending it with us.