Literature review · 6 episode(s)

Reinforcement learning and reasoning training

RL edits a few tokens, mostly

The dominant story used to be that RL teaches reasoning models new moves the way RL taught AlphaGo. The evidence increasingly says otherwise, at least for math: RL-trained and base models agree on 97–99% of tokens, with disagreements concentrated at high-entropy 'fork' positions where the right choice was already in the base model's top five, and a tiny contrastive run reproduces full RL accuracy for around $25 E026. RL looks like calibration of an already-capable base model rather than the discovery of new strategies.

But this is task-dependent, and that nuance matters. On compositional multi-hop agent tasks, RL genuinely solves problems the base model can't crack at any sampling budget, while supervised fine-tuning on the same data actively hurts and collapses strategy diversity E011. The reconciliation: RL reweights existing strategies, and whether that crosses into 'new capability' depends on whether the needed behavior was reachable in the base distribution at all.

Baselines were quietly broken

A sobering result for the entire reasoning-training literature: two silent bugs in widely-used training libraries — a misplaced branch in gradient accumulation and a mean-of-means loss aggregation error — had been deflating baselines across a subfield for over a year E009. Once corrected, plain SFT-then-RL beats every published mixed-policy method at roughly half the FLOPs. The structural lesson is that when a whole subfield's baselines flow through the same framework, independent replication becomes illusory and framework diversity is epistemic insurance.

Process rewards may outrank answers

A quiet inversion is underway in how we reward reasoning. Outcome-only RL converges on 'premature confidence' as a local optimum, where the answer is fixed at the first token and the chain of thought is decorative — detectable cheaply by truncating the trace and watching whether confidence is flat-high from the start E081. Operationalizing metacognition as a process reward produces a 9B model that beats far larger ones, and the ablation flips conventional wisdom: removing the process reward hurts more than removing the correctness reward E079.

The weak verifier is the field's hidden bottleneck. Letting a model learn good error-spotting with the answer key in hand and then taking the key away yields a critic that lifts an 8B model past a 235B one, and training inside the verification loop improves even the model's solo first attempts E099. The recurring caveat across all three: the reward signals are themselves LLM-generated, so whether models learn real metacognition or just its format remains open.

RL's invisible failure modes

Even when RL works, its standard dashboards lie. Entropy conflates within-prompt variation with dependence on the prompt, letting 'template collapse' — fluent reasoning that quietly stopped reading the input — run undetected; filtering low-reward-variance prompts recovers a 16-point gain while cutting compute E010. Separately, task-completion RL silently makes agents worse at exploring unfamiliar environments, pushing error-recovery rates to zero, with a cheap interleaving fix E052. And the safety edge: models can already reason about sabotaging their own RL, and deterministic failure undermines GRPO more effectively than random failure by collapsing the advantage signal E007.

The constructive endpoint is that recipe, not scale, can carry olympiad-grade reasoning: a 30B open model matched top human USAMO scores using a reverse-perplexity curriculum and a two-stage verifiable-then-proof-quality reward progression E048.

Episodes anchoring this topic

What RL Actually Does to Language Models, at the Token Level
Showed RL on math edits 1-3% of tokens the base model already considered.
When RL Actually Teaches Agents Something New, And When It Doesn't
Demonstrated RL genuinely expands capability on compositional tasks where SFT regresses.
How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers
Traced a whole subfield's deflated baselines to two silent library bugs.
An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models
Found process rewards can matter more than final-answer correctness.
When Reasoning Models Decide Before They Think: Detecting and Fixing Premature Confidence
Showed outcome-only RL converges on decorative chain-of-thought and how to detect it.
When Reward Climbs But Reasoning Goes Generic: Diagnosing Template Collapse in Agentic RL
Identified template collapse, a failure entropy metrics cannot see.