All episodes

Episode 163 · Jun 23, 2026 · 22 min

Why Training Only on Perfect Solutions Cripples a Model's Reasoning

Wei, Kim

LLM Post-training

AI Papers: A Deep Dive — Episode 163: Why Training Only on Perfect Solutions Cripples a Model's Reasoning — cover art

paperdive.ai

Listen

Ep. 163

Why Training Only on Perfect Solutions Cripples a Model's Reasoning

0:00

22 min

Concepts in this episode

Training Methods AI Alignment RL for Reasoning Supervised Fine-Tuning Chain of Thought Knowledge Distillation Policy Gradient Scaling Laws Synthetic Data Inference-Time Scaffolding Trajectory Quality Post-Training

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently

Venue

arXiv:2606.22938

Year

2026

Read the paper

arxiv.org/abs/2606.22938

Also available on

Apple Podcasts Spotify

Everyone assumes clean, flawless examples are the best reasoning data — and a new theory paper proves that intuition is backwards. By formalizing reasoning as path-finding through a maze, two researchers show imitation learning provably can't teach backtracking, while reinforcement learning learns it for free from the model's own failures. The result is a clean, exponential gap that reframes what 'high-quality reasoning data' even means.

What you'll take away

Why training on clean, backtracking-free solutions provably freezes a model's ability to retreat from dead ends — there's no gradient signal where there's no data
How modeling reasoning as path-finding through a maze turns 'backtracking' into something you can prove theorems about
The headline result: RL scales linearly with reasoning depth (W·K) while imitation blows up exponentially (W·L^K), from the identical starting model
Why bolting a clever search wrapper onto a weak imitation model helps a lot but still can't fully close the gap
The steelman critique: the central theorem is close to true by construction, and the exponential drama leans on a chosen graph topology and a deliberately pessimistic definition of SFT
The practical payoff — why distilling from an RL-trained model works precisely because you inherit its messy recoveries, not just its answers

Chapters

00:03Is clean data secretly the problem?
01:28Two ways to train, one key difference
03:41Turning reasoning into a maze
06:04No examples, no nudge
09:03Linear versus falling off a cliff
10:15How RL escapes the trap
13:12Does it survive a real algorithm?
15:19How true by construction is this?
18:39The dead ends are the curriculum

References in this episode

Tree of Thoughts: Deliberate Problem Solving with Large Language Models — The search-scaffolding approach the episode critiques — the authors show externa
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — The 'folklore' the episode says this theory paper finally formalizes — a flagshi
Proximal Policy Optimization Algorithms — The actual RL algorithm the paper uses to confirm its toy-model predictions on a
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — The reasoning paradigm the paper models as path-finding through a graph — useful

Full transcript

Also available as a plain-text transcript page.

0:00Cassidy: Here's a claim that sounds backwards: showing an AI model only correct, clean solutions might be exactly what cripples its ability to reason. Not too few examples. Not bad examples. Perfect ones.

0:13Finn: Quick heads up before we get into it — this is an AI-generated explainer, both voices included.

0:19Cassidy: And the reason this is worth your time is that it's not a vibe. There's a new theory paper that proves it — a clean, provable, exponential gap between the two main ways we train reasoning models. By the end you'll understand exactly why training on flawless solutions hits a wall that no amount of more flawless solutions can climb.

0:40Finn: And the punchline is almost subversive. The thing that makes reinforcement learning better at reasoning isn't some deep generalization magic. It's the failures. RL learns from the model's own dead ends — and dead ends are precisely the data that imitation learning structurally cannot contain.

1:00Cassidy: This matters because the entire field is making bets right now on what "high-quality reasoning data" means and where to spend training compute. And this paper says the intuition a lot of people have — clean is good, messy is noise — might be exactly inverted for the part of reasoning that matters most.

1:19Finn: So let's set up the fight properly, because the whole result lives in the difference between two training methods.

1:27Cassidy: Right. Every modern language model starts the same way — pretrained on a huge pile of text until it has a rough feel for how the world fits together. Then you sharpen it for actual tasks. That sharpening step is called post-training, and there are two dominant flavors.

1:44Finn: The first is supervised fine-tuning — SFT. You show the model a ton of worked examples of the right behavior and have it imitate them. Think of a student copying out correct solutions until they can reproduce them. The second is RLVR — reinforcement learning with verifiable rewards. You let the model actually attempt the task, automatically check whether the final answer is right, and reward it when it is.

2:11Cassidy: And the one distinction to hold in your head for the whole episode: SFT learns from somebody else's clean successes. RLVR learns from the model's own attempts — including, crucially, the ones that fail.

2:24Finn: That word "including" is the entire paper.

2:28Cassidy: Now, everyone in the field has noticed the same thing empirically. If you want a model that genuinely reasons — works through a hard problem, hits a wall, backs up, tries something else — reinforcement fine-tuning seems to beat plain imitation. The big reasoning systems made that almost folklore. But folklore is all it was. We had comparisons and hand-wavy stories about generalization versus memorization. No proof. No mechanism.

2:56Finn: So these two authors — Stanley Wei at Princeton and Juno Kim at Berkeley — ask a much sharper question than "which is better." They ask: can we isolate one specific capability, prove that RL learns it, and prove that imitation provably cannot?

3:13Cassidy: And the capability they pick is backtracking. The human skill of realizing you've gone down a wrong path, returning to an earlier decision, and trying another. They argue that's the load-bearing skill in real reasoning — and they want to know which training method actually teaches it.

3:31Finn: To turn that into something you can prove theorems about, they need reasoning to become a concrete, measurable thing. So here's the move that makes the whole paper work.

3:43Cassidy: They model chain-of-thought reasoning as walking through a graph. Every reasoning state is a spot in the graph, every next step is an edge you can take, and solving the problem means finding a path to a goal. Once reasoning is path-finding, backtracking stops being a metaphor — it's literally "I walked the wrong way, now how many steps does it take me to get back out."

4:06Finn: And the specific graph they build is worth picturing, because one image carries the whole result. Cassidy, walk people through the corridors.

4:15Cassidy: Okay. Picture a single fork at the very start that splits into W different directions — call them branches. Only one branch leads home to the goal. Each branch is a long corridor, K steps deep. And here's the key detail: at every step of a corridor, there are L parallel lanes — L equivalent ways to move forward to the next checkpoint.

4:37Finn: So the model's job is: pick a branch at the fork, walk down the corridor, and if it turns out to be a dead end, walk all the way back out and try a different branch.

4:48Cassidy: Exactly. And one more constraint that makes this a genuine search and not a guided tour: the model is about as simple as it gets. At each step it only knows the edge it's currently on, and it picks the next edge. It doesn't even get to see where the goal is. It's searching blind — a verifier just stops it when it stumbles onto the target.

5:09Finn: Which is a real idealization, and we'll come back to it — the fact that the model can't see where it's going is going to matter when we judge how much this tells us about real reasoning. Flag that for later.

5:24Cassidy: Fair flag. Now, the authors prove a quick warm-up result first: ordinary pretraining on this graph converges to the true world model. Meaning, after pretraining, at any spot the model just spreads its probability evenly over the legal next moves. It's learned the structure of the maze, nothing more. That lets them start the SFT-versus-RL comparison from an identical, clean starting point — so any difference is purely about post-training, not a lucky or unlucky initialization.

5:55Finn: Good. So both methods start from the same model that knows the maze's layout but has no strategy. Now we train. And this is where I want to slow down, because the mechanism behind SFT's failure is one almost embarrassingly simple fact about how imitation learning works — and it pays off in that provable exponential gap we promised.

6:17Cassidy: Before you do — one word people need, because everything hinges on it. Gradient.

6:22Finn: Right. When a model learns, it's minimizing a loss — a number that measures how wrong it is. Learning is just repeatedly nudging the model's internal parameters in whatever direction makes that number go down. The gradient is that nudge — the size and direction of the correction the model gets each step. That's it. No examples, no nudge. Hold onto that.

6:46Cassidy: Perfect. Go.

6:47Finn: So here's the load-bearing fact, and it's so simple it almost feels like a trick. For imitation learning, the nudge a given situation produces is proportional to how often that situation shows up in the training data. If a situation never appears in the data, it produces no nudge — and the parameters governing it never move from where they started. Ever.

7:10Cassidy: Which sounds trivial.

7:12Finn: It sounds trivial. It's the entire beam holding up the result. Because now ask: what's in the SFT training data? Golden shortest paths. You pick the right branch, you march straight to the goal, you never reverse. A perfect solution, by definition, contains no backtracking.

7:31Cassidy: So the states that face backward — the "I need to retreat" situations — never appear in the data.

7:37Finn: Never once. So by that no-examples-no-nudge rule, the parameters governing every backward-facing state stay frozen at their pretrained setting forever. The model's only backtracking ability is whatever the uniform maze-layout knowledge happened to give it. SFT provably teaches it nothing about how to back up — not because it's a weak algorithm, but because the information simply isn't in the data.

8:03Cassidy: And that frozen ability turns out to be catastrophically bad. This is where the corridor picture pays off.

8:10Finn: Take it.

8:11Cassidy: So you're a backward-facing state, deep in a wrong corridor, trying to retreat. What are your options? Remember every step has L parallel lanes pointing forward, plus one path continuing your retreat. The frozen model spreads probability evenly. So at every single step of trying to back out, there are L lanes pulling you forward — back toward the dead end — and only one lane continuing your escape.

8:36Finn: It's swimming against a current.

8:38Cassidy: It's swimming upstream against a current that sweeps you back most of the time. Over a short corridor, maybe you make it out. But the difficulty compounds at every step. Over a corridor K steps deep, you essentially never escape quickly. And when the authors work out the math, the expected time to back out of a wrong branch blows up like L to the power K. Exponential in the depth of the reasoning chain.

9:05Finn: So now put the whole thing together and you get the headline of the paper — and this is the number to remember.

9:12Cassidy: After training, the RLVR model finds the goal in time that scales like W times K. Linear in the depth. The SFT model? W times L-to-the-K. Exponential in the depth.

9:23Finn: Say what that means concretely. Linear versus exponential.

9:28Cassidy: It means: with the RL model, if you double how deep the reasoning has to go, you roughly double the cost. With the imitation model, if you add just one to the depth, you multiply the cost by L. Same starting model, same maze, same amount of data — one of them gracefully scales and the other one falls off a cliff. That gap is the paper.

9:50Finn: And notice what it is and isn't. It's not that SFT is slow. It's that SFT never received a single bit of feedback about the one skill the deep version of the task requires. The wall isn't capacity. It's the structure of the training signal itself.

10:06Cassidy: So the obvious question is: what does RL do differently? Because it starts from the identical frozen model. Finn, why does reinforcement learning escape the trap that imitation can't?

10:18Finn: Because RL generates its own data. Early in training, the RL model picks wrong branches constantly. It gets stuck, it flails, it racks up long meandering failed attempts. And those failed attempts repeatedly walk through exactly the backward-facing dead-end states.

10:36Cassidy: The states SFT never sees.

10:37Finn: The states SFT never sees, RL visits over and over. And the RL learning rule has two ingredients worth naming. First, how often the model visits a given state during its own attempts — states it keeps landing in get adjusted more. Second, the advantage of each action — how much better one move is than the others at reaching the goal sooner.

11:00Cassidy: So the visit count is doing the opposite of the cross-entropy rule.

11:04Finn: That's the hinge of the whole thing, Cassidy. Under imitation, dead-end states get zero signal because they're absent from clean data. Under RL, dead-end states get more signal precisely because the model keeps failing into them. Failure isn't noise — failure is what generates the gradient on the exact states that need it.

11:26Cassidy: And the advantage term is where the length penalty comes in.

11:30Finn: Right. The reward is basically "did you reach the goal," minus a tiny penalty for wasting moves. That penalty creates pressure to stop flailing — to make backtracking decisive instead of wishy-washy. So the policy converges to a clean strategy: forward states go forward, and backward states fully commit to retreating. Now backing out of a wrong branch is linear in K, not exponential. And the whole search comes out to that W-times-K.

11:59Cassidy: There's one detail in the RL convergence that I think is genuinely charming, because it shows the path to competence isn't a straight line.

12:08Finn: The confused middle.

12:09Cassidy: The confused middle. The authors prove — and see it in experiments — that early in RL training, some of the middle-depth backward states briefly move the wrong direction. For a window, the model actually gets worse at backtracking in the middle of the corridor before the whole system corrects and starts improving everywhere. RL's route to the right answer dips before it climbs. They're upfront that this non-monotone stretch is the messiest, most technical part of the proof.

12:41Finn: And honestly that's the part I trust most, because it's the part they didn't have to admit.

12:47Cassidy: So let's checkpoint, because we've covered the spine. Imitation on perfect solutions freezes the backtracking states — no dead ends in the data, no learning, exponential blowup. RL trains on its own failures, so those same states get the most signal, and it converges to linear. Now — does any of this survive contact with a real training algorithm?

13:10Finn: This is the question I had, because everything so far is on a deliberately simple model. So here's the reassuring part.

13:18Cassidy: They don't just run the toy. They confirm the predicted optimum with real machinery. They use PPO — an actual, standard reinforcement learning algorithm. They swap the simple policy for a single-layer transformer — which, fun detail, does slightly better than the toy because it's more expressive. And they test a non-symmetric graph with branches of different lengths. All of them converge to the predicted optimum.

13:44Finn: And the concrete number is satisfying. Fifteen branches, fifteen steps deep, five lanes each. Theory predicts the RL model converges to four times W times K — nine hundred steps. And it lands right there.

13:57Cassidy: Nine hundred, on the nose. There's something nice about a theory paper where the toy prediction and the real-RL experiment agree to the number.

14:07Finn: Now here's the result I find most counterintuitive — and it pushes back on a popular idea. A lot of people think you can fix a weak model with clever scaffolding at runtime — bolt an explicit search system on top that's smart about not revisiting places it's already been.

14:25Cassidy: The tree-of-thoughts, graph-of-thoughts style frameworks.

14:28Finn: Exactly. So the authors ask: what if you wrap the imitation model in a perfect search harness that refuses to ever revisit an edge? Does that close the gap? And the answer is — it helps enormously. It drops SFT from exponential all the way down to W times K times L. Polynomial now, not exponential.

14:47Cassidy: But.

14:48Finn: But it's still a full factor of L slower than the RL model. No matter how clever your search wrapper, you're stuck paying that extra factor, because the model underneath still never learned to back up. Scaffolding compensates for the missing skill. It doesn't replace having learned it.

15:07Cassidy: That's the counterintuitive beat — external orchestration helps a lot and still can't fully substitute for backtracking baked into the weights.

15:16Finn: Okay. Now I want to be the skeptic, because this is a clean result and clean results deserve a hard look. And the sharpest objection is that the central theorem is close to true by construction.

15:29Cassidy: Say more.

15:30Finn: The paper defines SFT as training only on golden shortest paths, with explicitly zero backtracking and zero negative examples. Once you set it up that way, "backward states get no gradient" follows almost immediately from that basic cross-entropy fact we leaned on. So a skeptic can fairly say: this isn't so much a discovery as a careful formalization of an assumption baked into the premise. Real SFT datasets often do contain some imperfect or self-correcting traces. The paper's SFT is a deliberately pessimistic, strawman-clean version — and the dramatic exponential headline is strongest precisely against that strawman.

16:10Cassidy: That's a fair hit. Although the authors are upfront that the pessimism is the point.

16:15Finn: They are. But there's more, and it stacks. Remember that flag from earlier — the model can't see where it's going. It's a target-independent search; it doesn't condition on the problem it's solving at all. Real reasoning models condition heavily on the prompt and reason toward a specific goal. Whether these backtracking dynamics survive when the policy can actually see its target — that's just not addressed, and it's a real gap between this sandbox and chain-of-thought as actually practiced.

16:47Cassidy: And the exponential base itself is a modeling choice.

16:50Finn: That's the one I'd press hardest. The L parallel lanes are what manufacture the L-to-the-K blowup. That topology encodes the idea that backtracking faces lots of tempting forward U-turns. It's plausible for real search spaces — but the drama of the exponential depends on that specific structure. A different graph gives a milder separation. Plus the RL analysis leans on idealizations — population-level gradients, continuous-time dynamics, and a simplified version of the learning rule that uses only the sign of the gradient, not its real magnitude. The transformer experiment is reassuring, but it's a single-layer transformer started right at the toy policy. It's not a stress test of those idealizations.

17:36Cassidy: So where does that leave the contribution, honestly?

17:40Finn: I'd put it this way. The paper doesn't discover that RL beats imitation at reasoning — everybody already believed that. What it does is formalize one concrete instance of it and quantify the gap as exponential in reasoning depth, with a clean mechanism for why. That's genuinely valuable. But the word "provable" in the title is doing work against a clean toy and a pessimistic definition of the rival. The honest framing is: this turns a known intuition into a sharp, mechanistic theorem — not that it settles the real-world question.

18:14Cassidy: And I'll concede all of that. It is a toy, the SFT setup is deliberately pessimistic, and the exponential leans on the chosen topology. What I won't give up is the mechanism, because the mechanism is what makes the last result land — and the last result is the one practitioners should actually take home.

18:34Finn: The distillation fix.

18:35Cassidy: The distillation fix. Because here's the thing the whole paper has been building toward. SFT's failure was never about SFT being a bad algorithm. It was about the data containing no failures. So what happens if you change the data?

18:50Finn: Specifically — you take a good RL-trained model, let it generate its reasoning traces, and those traces include all the messy backtracking, the wrong turns and recoveries. Then you train a fresh base model by plain imitation on those traces.

19:05Cassidy: And now the backward states do appear in the training data. So by that exact same no-examples-no-nudge rule — which was working against imitation this whole time — the gradient finally flows to them. They get learned. And the fresh model recovers the efficient, linear-time backtracking. The cross-entropy rule that doomed clean-data SFT starts working for you the moment the data has dead ends in it.

19:30Finn: Which reframes what "high-quality reasoning data" even means. Quality isn't cleanliness. The polished, textbook-perfect worked solution — the thing you'd intuitively call the best training data — is exactly the wrong dataset, because it's been scrubbed of the one thing worth learning.

19:49Cassidy: A study guide that shows you the final proof teaches you less than a transcript of someone actually struggling — "I tried this, it broke, so I backed up and went the other way." The struggle is the lesson.

20:01Finn: And it gives a clean, theoretically grounded reason for something the field already does — distilling from reasoning models works, and it works because you're inheriting the recoveries, not just the answers. With the honest caveat: it still requires already having one good reasoner to generate those traces. It explains why distillation works. It doesn't tell you how to get the first reasoner without RL.

20:27Cassidy: So here's the big idea to walk away with, bigger than any single theorem. The value of reinforcement learning for reasoning isn't some mystical generalization. It's mechanical and almost mundane: RL trains on the model's own failures, and failure is the data imitation structurally can't contain. The dead ends are the curriculum. And once you see that, you stop asking for cleaner reasoning data and start asking whether your data has any dead ends in it at all.

20:56Finn: Which leaves a real question for where the field spends its effort. If the whole advantage of RL here is just exposure to recovery moves — then is the future about running expensive reinforcement learning everywhere, or is it about getting good at curating data that keeps the struggle in?

21:15Cassidy: So that's our question for you. If quality reasoning data is data with dead ends in it — should the field pour its compute into on-policy RL on every model, or into building datasets full of honest mistakes-and-recoveries and just distilling them in? Pick a side in the comments — we read them.

21:34Finn: If you want to go deeper, the full annotated version of this episode is on paperdive.ai — every technical term tap-to-define, with links to the related papers grouped by theme, plus our weekly and monthly roundups.

21:48Cassidy: Quick housekeeping on the way out: this script was written by Anthropic's Claude Opus 4.8, Finn and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is "Provable Benefits of RLVR over SFT for Reasoning Models," published June 22nd, 2026 — we recorded this the very next day.

22:09Finn: The lesson holds for us too: the trick isn't showing the model a perfect path. It's letting it learn where the dead ends are. See you in the next one.