All episodes

Episode 108 · Jun 03, 2026 · 32 min

The Reasoning Cliff: Why Thinking Longer Makes Models Worse at Exact Step-by-Step Tasks

Guo, Wu, Yiu

LLM Reasoning

AI Papers: A Deep Dive — Episode 108: The Reasoning Cliff: Why Thinking Longer Makes Models Worse at Exact Step-by-Step Tasks — cover art

paperdive.ai

Listen

Ep. 108

The Reasoning Cliff: Why Thinking Longer Makes Models Worse at Exact Step-by-Step Tasks

0:00

32 min

Concepts in this episode

AI Safety Evaluation & Benchmarks Training Methods Chain of Thought Test-Time Compute Transformer Attention Tool Use Long-Horizon Tasks Scaling Laws Hallucination Supervised Fine-Tuning Attention Heads Emergent Behavior

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary

Venue

arXiv:2606.00376

Year

2026

Read the paper

arxiv.org/abs/2606.00376

Also available on

Apple Podcasts Spotify

Hand a frontier reasoning model a puzzle a laptop solves in a tenth of a second, give it all the time it wants, and it fails — and it fails worse the longer it thinks. A new paper argues there's a predictable depth, baked into the architecture, past which a model stops computing and starts confidently narrating a fictional version of the problem. If they're right, the two-year industry bet on 'just let it reason longer' is exactly backwards for an entire class of tasks.

What you'll take away

Why accuracy on exact, deterministic tasks doesn't fade gently but collapses super-exponentially past a horizon of roughly 20-30 reasoning steps
How a model's real working memory — set by attention head count and width, not the advertised context window — differs from its context size by three orders of magnitude
The detective-story experiment that distinguishes a fixable 'bad habit' from unfixable 'broken bones': fine-tuning recovered just 3.2% against a predicted 30%
Why shrinking the context window 16-fold left the failure horizon completely unchanged, ruling out the boring 'ran out of room' explanation
Where the paper's strongest claims rest on soft ground: the central capacity theorem leans on unproven modeling assumptions, and the dramatic tool-versus-reasoning gap uses a perfect oracle that real tools won't match
The 'Simulator Fallacy' — the difference between a model executing an algorithm and writing convincing text about executing one, and why that means longer reasoning can actively hurt

Chapters

00:00The puzzle that gets harder the longer you think
03:30Two suspects: bad habit or broken bones
07:00What kind of task actually breaks
10:30The cliff and the flashlights
14:00Why the slope becomes a cliff
17:31Adjudicating the two theories
21:01The smoking-gun diagnostics
24:31Where the paper is soft
28:01Why it matters and the Simulator Fallacy

References in this episode

Chain-of-Thought Empowers Transformers to Solve Inherently Serial Problems — The expressivity result the episode invokes near the end — chain-of-thought expa
On the Measure of Intelligence — Chollet's framing of skill versus generalization underlies the episode's 'simula
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models — An empirical critique showing LLM reasoning accuracy degrades with added complex
Large Language Models Cannot Self-Correct Reasoning Yet — Directly tests whether more deliberation helps, supporting the episode's inversi

Full transcript

Also available as a plain-text transcript page.

0:00Juniper: Here's a puzzle a laptop solves in about a tenth of a second. You've got a list of numbers, scrambled — say three, one, four, two, five — and you want to sort it by composing a sequence of swaps, finding the shortest path. A textbook search algorithm chews through that instantly, every time, guaranteed correct. Now hand the same puzzle to a frontier reasoning model. One of the big ones, the kind built to think for minutes before it answers. Give it all the time it wants. And it fails. Not always — but reliably, once the puzzle gets long enough. And here's the part that should stop you: it fails *worse* the longer it thinks.

0:40Finn: Worse the longer it thinks. That's the inversion of the entire pitch behind reasoning models.

0:47Juniper: Completely. The whole industry bet of the last couple years is "let the model deliberate longer and it gets smarter." More steps, more compute at inference time, better answers. And this paper says: for a specific and important class of problems, that curve doesn't just flatten — it bends down. Extended reasoning becomes the thing causing the collapse. The paper went up on arXiv on May twenty-ninth, twenty-twenty-six, and we're recording five days later, on June third. It's called "The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary." Quick note before we dig in — this episode is AI-generated. The script was written by Anthropic's Claude Opus 4.8, and the two voices you're hearing, me, Juniper, and my co-host Finn, are both AI voices from Eleven Labs. The producer isn't affiliated with either Anthropic or Eleven Labs. And that "deterministic horizon" in the title is the thing to hold onto — because the paper's claim is that there's a specific depth, a number of reasoning steps, past which the model is essentially making up its own internal state. And that wall isn't a training bug you can fix. It's baked into the architecture.

2:02Finn: Which is a big claim, and it's worth saying upfront why it's a big claim. There are two completely different stories you could tell about why a model fails a hard multi-step task. Story one: it's a bad habit. The model learned to prefer short answers, it bails out early, and you could retrain that away. Story two: it's broken bones. The thing is built in a way that makes the task impossible past a certain scale, and no amount of training data touches it.

2:34Juniper: And those two stories are the whole drama. Because if it's the habit, the fix is "train better." If it's the bones, the fix is "stop asking the neural network to do this and call a tool instead." Same symptom, opposite prescriptions.

2:49Finn: Right. The authors — Dongxin Guo and Siu Ming Yiu at the University of Hong Kong, with Jikun Wu — structure the entire paper as a contest between those two suspects. And I want to walk through how they adjudicate it, because it's genuinely good detective work. But Juniper, set up the phenomenon first. What's actually breaking?

3:12Juniper: So the first thing to nail down — and if a listener takes one distinction away, it's this one — is *what kind of task* we're talking about. The paper is very precise about its scope. It's about deterministic, exactly-checkable state tracking. Think tracing a Python variable through forty mutations. Composing a long sequence of permutations. Simulating a little finite-state machine. The defining feature is that at every step there's exactly one right answer, and a single wrong step poisons everything downstream. Correctness is binary. You don't get partial credit for being close.

3:52Finn: And that's the opposite of most of what we ask these models to do.

3:56Juniper: Exactly the opposite of open-ended generation. When a model writes prose, or even works a typical grade-school word problem, "approximately right" is usually fine. A small slip washes out — the next sentence corrects it, or it just doesn't matter. There's slack in the system. In the deterministic regime there's no slack. It's like the difference between sketching someone's face from memory — close enough reads as them — versus typing a forty-character password. One wrong character and you're locked out, and it doesn't matter that the other thirty-nine were perfect.

4:38Finn: And that's the misreading to guard against. This is not the paper saying "reasoning is bad." It's saying reasoning is bad for the narrow class of problems where every single step has to be exactly right and the errors can't wash out.

4:55Juniper: Which is a narrow class by description and a huge class in practice. Tracing program state. Multi-hop entity tracking. Many-table database joins. Formal verification. Planning. A lot of what we actually want agents to do for real engineering work lives right here. So. Picture the central graph of the paper. On the horizontal axis, reasoning depth — how many steps the problem requires. On the vertical, accuracy. At depth ten, the models are holding around seventy-eight percent. Respectable. By depth thirty, they've dropped to about thirty-four percent. By depth fifty, they're basically guessing — flat against random.

5:41Finn: It's a cliff.

5:42Juniper: It's a cliff. You hold the high ground for a while, and then the floor just leaves. And here's the line you draw across that same graph: a model that, instead of reasoning through it, just calls an actual search algorithm — a tool — and reports the answer. That line sits up at around ninety percent and stays flat. Doesn't care how deep the problem is. The headline numbers: tool delegation lands at eighty-six to ninety-four percent accuracy. Neural chain-of-thought, on the same tasks, plateaus at twenty-four to forty-two percent. The effect size is, in the authors' terms, enormous.

6:23Finn: So before we get to *why* — that gap is measuring something specific, and we'll come back to whether it's a fair fight. The tool is a perfect oracle. Flag it now; I'll cash it in later.

6:37Juniper: Fair flag. Hold that. Because the "why" is where this gets beautiful, and it starts with the single most surprising number in the paper. These models advertise context windows of a hundred thousand tokens. Some far more. The marketing intuition is: that's the model's memory. That's how much it can hold in its head. The paper says — no. The amount of reasoning the model can actually keep *in focus*, reliably, is something like a hundred to a hundred and fifty steps. The effective working memory and the advertised context window differ by three orders of magnitude.

7:17Finn: Three orders of magnitude. So the desk is enormous and the part that's actually lit is tiny.

7:24Juniper: That's exactly the right image, and it's the one to live in for the rest of this. Picture working a long problem at a desk lit by a fixed number of small flashlights. Early on, you've got a couple pages of scratch work and everything's well-lit. As the work grows to dozens of pages, those same flashlights have to cover more and more paper. Any one earlier line gets dimmer. Harder to read. Your working memory isn't set by how big the desk is. It's set by how many flashlights you have and how bright each one is. The context window is the size of the desk. It's almost irrelevant. What matters is the lighting — and the lighting is fixed.

8:08Finn: And the flashlights, in the real architecture, are —

8:11Juniper: The attention heads. A transformer has a fixed number of attention heads, and each one carries information in a channel of some fixed width. Those two numbers — how many heads, how wide each channel — that's the model's actual working memory budget. Not the token count. And here's the mechanism for why the cliff happens. Every step of chain-of-thought is just more text appended to the pile. To take the next step correctly, the model has to look back over everything it's written and re-derive where it is in the problem. But attention isn't infinite focus spread evenly — it's a spotlight with a fixed amount of light. The math the paper leans on says that as the text gets longer, attention concentrates on a shrinking fraction of positions, and any given earlier state gets diluted. The thing you need to retrieve is fading.

9:08Finn: So it's a telephone game.

9:10Juniper: It starts as a telephone game. Each step depends on correctly carrying forward the state from the last step, and small errors don't cancel — they propagate. You end up confidently somewhere that has nothing to do with the truth. That's the failure mode the authors name State-Space Decoherence — your internal picture of "where I am" decoheres from reality. But here's where it gets worse than telephone. In the classic party game, each whisperer is about as unreliable as the last — the per-link error rate is roughly constant. The paper's sharper claim is that the error rate *grows* as the chain lengthens. It's a telephone game where each successive person is also getting progressively more distracted, because the information they need to pass on is the very information that's fading from focus.

10:03Finn: And that's the thing that turns a slope into a cliff.

10:07Juniper: That's exactly it. If the per-step error were constant, you'd get smooth exponential decay — a leaky faucet, steady drip, each step about as risky as the last. But because the error rate itself climbs with depth, the decay isn't a faucet. It's a snowball. The loss accelerates. There's a term in their decay formula that's quadratic in depth, and the quadratic is the killer. They give a vivid number for it. With their fitted parameters, at around depth thirty, the "compounding" piece of the error — the part that grows with the square of depth — has gotten as large as the baseline per-step error. That's the moment the snowball overtakes the drip. That's where the cliff is.

10:52Finn: And they checked that the snowball shape actually fits the data, not just the linear or simple-exponential alternatives?

11:00Juniper: They did, and it's not close. The super-exponential form fits with an R-squared of about point nine-six. The plain linear story, around point seven-one. Simple exponential, point eight-three. The accelerating-collapse shape is just visibly the right curve.

11:17Finn: Okay. So that's the phenomenon and the mechanism. Now I want to take the detective story, because this is where the paper earns its keep. Everything Juniper just described is a *theory* — accumulated error, architectural fading. But there's a rival theory sitting right next to it that explains the exact same cliff.

11:38Juniper: The bad-habit story.

11:39Finn: The bad-habit story. There's prior work — Wu et al., same year — that explains the inverted accuracy curve a completely different way. They call it Simplicity Bias. The claim is: models have learned a *preference* for short outputs. They bail out early. They don't keep going long enough to get the deep problems right. And crucially, that's a trainable habit. Retrain it and the problem goes away. And from the outside, Simplicity Bias and Decoherence look identical. Both predict accuracy falling off as problems get deeper. So how do you tell a bad habit from broken bones?

12:17Juniper: They have different alibis.

12:19Finn: They have different alibis. And this is the move that makes the paper good science. The authors write down, in advance, what each suspect predicts — before they go collect the evidence. Three predictions where the two theories diverge sharply. One: if it's a bad habit — a preference for shortness — then fine-tuning the model on good, optimal-length reasoning traces should recover a big chunk of the lost accuracy. Simplicity Bias predicts north of thirty percent recovery. Decoherence — the broken-bones story — predicts under five. Because if the bones are broken, showing the model better examples doesn't grow it new flashlights. Two: if it's a preference, then just prompting the model — "take as many steps as you need, don't rush" — should help. Simplicity Bias predicts better than ten percent improvement from that nudge. Decoherence predicts under two. Three: cross-model correlation. If each model's failure comes from its own idiosyncratic training, different models should fail on different problems. But if the cause is shared architecture, models from completely different labs should fail on the *same specific instances*. Simplicity Bias predicts low correlation. Decoherence predicts high.

13:40Juniper: And then they go get the numbers.

13:42Finn: They go get the numbers. Fine-tuning recovery: predicted over thirty percent if it's a habit. Observed — three point two percent. The length-encouragement prompt: predicted over ten percent. Observed — under one. Zero point nine. And cross-model correlation, across OpenAI, Anthropic, DeepSeek, Meta, Alibaba — all failing on the same instances — correlations above point eight. The highest, around point nine-one, between Llama and Qwen, which the authors attribute to similar pretraining.

14:15Juniper: Three point two percent against a predicted thirty.

14:19Finn: That's the case closed. Every divergent prediction lands on the broken-bones side. The failures aren't a preference the model could be talked or trained out of. They're structural.

14:31Juniper: And I love that the fine-tuning result has a kicker, because that's the one a skeptic leans on hardest — "you just didn't train it right." So they push it. They take a model and fine-tune it *exclusively* on traces at depth thirty to forty. Train it on nothing but the hard cases. And it still can't break fifteen percent accuracy past depth forty.

14:55Finn: You cannot train your way past the wall. Even when the wall is the only thing you train on.

15:01Juniper: There's one more piece of their detective kit I think is the most elegant single experiment in the paper. It's the diagnostic they use to prove the model genuinely *can't* track, versus *won't*. At each step, they compare two things: the set of states the model *claims* it's in, against the set of states actually reachable at that point. And they split that comparison into precision and recall.

15:28Finn: Walk through what those two would do under each theory.

15:32Juniper: So — if the failure were mere preference, the model truncating early but not actually getting lost, you'd expect precision to stay high while recall drops. Meaning: everything the model says is still correct, it's just stopping short. It's not wrong, it's incomplete. But if the failure is genuine incapacity — the model hallucinating its own state, drifting into fictitious places — then both precision and recall decay together. It's not just stopping early. It's confidently in the wrong room. What they observe is both collapsing in parallel. Precision falls from around point nine-three down to point one-one. Recall, from point eight-nine to point zero-seven. Across depths five to fifty, hand in hand.

16:20Finn: Both decaying together. So it's not "the model stops." It's "the model is somewhere that doesn't exist."

16:27Juniper: That parallel decay is the smoking gun. The model isn't truncating a correct trace. It's drifting into a fictional one and not knowing it. And there's a concrete example in the appendix that makes this vivid. That sorting puzzle from the top — three, one, four, two, five, heading toward sorted order. The model gets step one right. Then at step two, it correctly understands that it needs to move some elements — it's got the right idea — but it misremembers which positions they're in. It swaps the wrong things. And then everything after that cascades off the wrong board.

17:04Finn: It's not that it forgot the algorithm. It knew the algorithm. It lost track of the state the algorithm was operating on.

17:12Juniper: Right. It's narrating the right procedure over a board that's quietly diverged from the real one. Which is, when you sit with it, exactly what decoherence should look like.

17:23Finn: Now I want to hit the experiment that I think does the most to rule out the boring explanation. Because the obvious objection to all of this is: "the model just ran out of room. The reasoning trace got too long for the context window." If that were true, the whole story is mundane — it's a token budget problem, not an architecture problem.

17:45Juniper: The desk-got-full explanation.

17:47Finn: The desk-got-full explanation. So they test it directly. They take Llama-3.3-70B and they artificially shrink its context window. From a hundred and twenty-eight thousand tokens down to eight thousand. If raw context length were the binding constraint, that should crater the horizon — shrink the desk, shrink what you can do. The horizon doesn't move. It stays flat at twenty-eight. They cut the window by a factor of sixteen and the depth at which the model collapses doesn't budge.

18:18Juniper: At all?

18:19Finn: It only starts to collapse when they shrink the window down around two thousand tokens — and at that point it's so small it can't even physically hold the model's own reasoning trace. So that's a different, trivial failure. The interesting result is the flat part: enormous changes to the desk size, zero change to the horizon. Which is exactly what you'd predict if the bottleneck is the flashlights, not the desk.

18:45Juniper: That's the cleanest demonstration of the whole L-effective versus L distinction. The usable working memory and the advertised context are just different things, and one of them is doing all the work.

18:57Finn: And on the open-weight models — the ones where you can actually read the internals — they close the loop on the mechanism. They pull out attention entropy step by step. As reasoning depth grows, the entropy grows roughly linearly — the spotlight is literally spreading thinner — and that spreading correlates negatively with accuracy, around minus point seven-four, strongest in the late layers. So the abstract "attention dilutes" claim shows up as a measurable signal you can watch climb.

19:28Juniper: So let me try to state the core insight in one breath, now that all the pieces are on the table. On any task where every step has to be exactly right, a decoder-only transformer's attention is a fixed-capacity channel that can't keep enough of its own history in focus. So accuracy doesn't fade gently — it collapses super-exponentially past a predictable depth, somewhere around twenty to thirty steps. And that depth is set by the architecture — the head count and the head width — not by the context window, not by the training, not by how long you let it think. Which is why no amount of reasoning or fine-tuning moves it.

20:08Finn: And that's the thesis line the authors give, almost word for word: even if the models *wanted* to reason correctly at depth fifty, they can't. The attention bottleneck prevents reliable state tracking. It's not a motivation problem.

20:23Juniper: Now — I think this is the right moment to be honest about how solid all of this actually is, because the paper itself is unusually candid, and an episode that just runs the highlight reel would be doing it a disservice.

20:36Finn: Yeah, and I've got a list, because some of these are real. The biggest one, and the authors say this themselves: the central capacity theorem — the thing that bounds trackable states by head count and dimension — rests on modeling assumptions that they classify as empirically grounded but not proven. Specifically, the theorem leans on an assumption about how attention concentrates — that it effectively focuses on roughly the square root of the sequence length's worth of positions — and on a second assumption about how the value channels decorrelate, and then there's a square-root effective-rank step that they flat-out call "a modeling assumption rather than a consequence" of the bound. So this isn't a derivation from first principles. It's a principled curve, fit to behavior.

21:24Juniper: Which matters most for the scaling law, right? The prediction that the horizon grows with the square root of head dimension times head count.

21:33Finn: That's the sharpest version of the worry. They fit a formula predicting the horizon from architecture, and it matches the observed horizons within one to two percent. Sounds incredible. But it's a fit, validated on four open-weight models — two Llamas, two Qwens. And the authors explicitly say they do *not* present the scaling law as a corollary of the theorem. So the headline "one to two percent agreement" is agreement between an empirically fitted formula and the data it was fitted to, on a pretty narrow base.

22:04Juniper: And the closed models — GPT-4o, Claude, the o-series — you can't even check the architecture, because the head counts aren't public.

22:12Finn: Right, so for the proprietary models, everything is an empirical fit and a consistency check. None of the *architectural* evidence — the part that justifies "this is unfixable" — comes from the closed models. It all comes from those four open checkpoints. The authors are clear about this, but it means the strongest theoretical claim sits on a narrow foundation.

22:35Juniper: Finn, what's your read on the tool comparison? Because that's the number everyone's going to quote — eighty-six to ninety-four versus twenty-four to forty-two — and you flagged it early.

22:46Finn: It's the one I'd push hardest. That comparison uses a perfect oracle. The tool the model calls is an exact search solver — guaranteed correct, every time. So that headline gap isn't "neural reasoning versus a realistic tool." It's "neural reasoning versus a guaranteed-correct algorithm." It's an upper bound on the benefit. In the real world, your tools are noisy. Your parser misreads the problem, your solver has bugs, the API times out. With imperfect tools the gap shrinks, possibly a lot. The authors concede this — they say so directly — but the dramatic headline number is not a fair fight, and it's worth keeping that in view when someone waves it around.

23:28Juniper: And the practitioner takeaway — "delegate past about twenty steps" — is fuzzier than it sounds, too.

23:34Finn: It is. That horizon depends on a baseline error rate, and that baseline is sensitive to prompt format and few-shot conditioning. Their own sensitivity analysis: shift the baseline error by twenty percent and the horizon moves by a couple steps. So the crisp "nineteen to thirty-one" interval is really a regime indicator, not a hard per-model constant. The honest version of the advice is "somewhere in the low twenties, give or take, depending on how you set things up" — which is mushier than the clean framing implies.

24:07Juniper: I think that's the right way to hold it, though. The exact number is soft. The *shape* of the claim — that there's a horizon at all, that it's architectural, that it doesn't move when you train or prompt or expand the window — that's what the evidence actually supports, and it supports it pretty well. The decoherence story survives every discriminating test they threw at it. The precise location of the wall is the part with the error bars.

24:35Finn: That's fair. And to their credit, the authors are also explicit about scope in a way that defuses the obvious overreach. They are *not* claiming longer reasoning is universally bad. For something like a grade-school math problem — typically under fifteen steps, tolerant of a little approximation — extended reasoning still helps. That's how they reconcile their cliff with all the prior work showing reasoning works. Simplicity Bias might truncate you early, decoherence stops you from recovering late, and for short forgiving tasks neither one bites.

25:11Juniper: So both suspects were at the scene. The detective just proved which one fired the shot on the hard cases.

25:18Finn: Both suspects were at the scene. That's the honest version of the verdict.

25:23Juniper: So let's talk about why any of this matters beyond the puzzle. The dominant story in agentic AI for the last couple years has been "scale up the reasoning." Let the model think longer, sample more, deliberate harder. This paper says: for an entire class of exact, deterministic tasks — and these tasks are everywhere in software engineering, in verification, in planning — extended neural reasoning isn't just inefficient. Past the horizon, it's actively harmful. More thinking moves you further from the right answer. And it gives practitioners an actual number to act on. If your task has more than roughly twenty deterministic state transitions — tracing a variable through twenty-plus mutations, composing a long permutation, a many-table join — the default should flip. Stop reasoning. Call a tool.

26:15Finn: And the payoff isn't only accuracy. There's a cost story that I think is underrated. Tool delegation in their setup is something like four to five times cheaper per correct solution. And the brute-force alternative people reach for — sample the model ten times, take the best answer — that costs eleven times more and still only gets you to around fifty-five percent, versus the tool's ninety. So you can pour money into sampling and still lose to a tool that's an order of magnitude cheaper.

26:47Juniper: For a system handling millions of queries, that's real infrastructure money. And real energy.

26:53Finn: It is. Though — consistent with the steelman — that cost gap also rides on the perfect oracle. Real tools cost more to build and maintain than calling a clean solver. So directionally right, magnitude soft. Same asterisk as before.

27:07Juniper: There's a deeper reframe under all of this, though, and I think it's the thing that outlasts any specific benchmark number. The authors call it the Simulator Fallacy.

27:18Finn: Say what it is, because it's the cleanest idea in the paper.

27:22Juniper: It's the confusion between a model *predicting plausible text about an algorithm* and a model *executing the algorithm*. When a reasoning model writes out its steps, it feels like it's running a computation. It isn't. It's generating convincing-looking text *about* running a computation. And the analogy that makes it click for me: it's an actor reading lines about flying a plane, versus an autopilot actually flying it. For a short, forgiving scene the actor's performance is completely convincing — fluent, plausible, gets the words right. But ask the actor to fly through thirty precise, interdependent maneuvers, and the narration drifts away from anything a real flight would do. Because nothing was ever actually computing the trajectory. There was no plane.

28:11Finn: And the subtle part — the part that makes it more than a gotcha — is that the actor isn't faking. There's no bad faith. A perfectly sincere, well-intentioned narration still can't access the underlying computation past the horizon. The model isn't lying about its state. It genuinely believes it's in a state it isn't.

28:32Juniper: That's what makes "even if it wanted to, it couldn't" the right phrasing. It's not unwillingness. There's no plane to fly. And the practical edge of the reframe is sharp: if these failures are architectural, then the entire toolkit we usually reach for — better prompts, reinforcement learning from human feedback, length encouragement, more data — none of it can touch them. The only two real moves are: change the architecture, give it something like explicit state registers, a real memory it isn't squinting at through diluted attention — or admit the limit and route to a tool.

29:10Finn: And I think that's the conceptual space the paper is really trying to claim. There's older theory showing chain-of-thought actually *expands* what transformers can compute — gives them more reach on paper. This paper's twist is to separate "expressible in principle" from "reliably executable in practice." The capacity might exist. But accumulated error along the way can stop you from ever cashing it in. The reservoir is enormous; the straw is narrow.

29:39Juniper: Which connects to a line some of us have heard before — the planning researchers who've been saying for a while that these models can't plan, but they can help you plan. This paper gives that intuition a mechanism and a number. It's not "models are dumb at search." It's "the channel through which they'd have to do search is too narrow per step, and it gets narrower the longer they go."

30:04Finn: The thing I'll carry out of it is the inversion. We've spent two years treating more reasoning as strictly more capability. And the paper's quiet, well-defended point is that for exact step-by-step work, there's a depth past which the reasoning trace stops being a computation and becomes a story about one — and that the story and the computation come apart at a place you can roughly predict from the architecture.

30:30Juniper: A horizon. Past it, the map and the territory separate, and the model doesn't know which side it's on. That's the paper. "The Deterministic Horizon" — a genuinely sharp piece of work, candid about its own soft spots, and built around one of the cleaner experimental designs we've covered: two theories, divergent predictions written down in advance, and a fine-tuning number — three point two percent against a predicted thirty — that settles it.

30:57Finn: The show notes have a link to the paper and a few related reads if you want to pull on this thread yourself — the rival Simplicity Bias work especially is worth a look to see the contest from the other side.

31:10Juniper: And if you want the full transcript with every term defined inline, plus the concept pages that link this to the other episodes we've done on reasoning and attention, that all lives on paperdive.ai.

31:22Finn: Next time someone tells you the model just needs to think a little longer, ask them how many steps. Past the horizon, longer is the problem.

31:31Juniper: This has been AI Papers: A Deep Dive. Thanks for listening.

The Reasoning Cliff: Why Thinking Longer Makes Models Worse at Exact Step-by-Step Tasks

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes