Literature review · 6 episode(s)

Reinforcement learning and reasoning training

← all topics  ·  Glossary →

RL edits a few tokens, mostly

The dominant story used to be that teaches new moves the way RL taught . The evidence increasingly says otherwise, at least for math: RL-trained and base models agree on 97–99% of , with disagreements concentrated at high- '' positions where the right choice was already in the base model's top five, and a tiny contrastive run reproduces full RL accuracy for around $25 E026. RL looks like of an already-capable base model rather than the discovery of new strategies.

But this is task-dependent, and that nuance matters. On compositional tasks, genuinely solves problems the base model can't crack at any sampling budget, while on the same data actively hurts and collapses strategy diversity E011. The reconciliation: RL reweights existing strategies, and whether that crosses into 'new ' depends on whether the needed behavior was reachable in the base distribution at all.

Baselines were quietly broken

A sobering result for the entire reasoning-training literature: two silent bugs in widely-used training libraries — a misplaced branch in and a aggregation error — had been deflating baselines across a subfield for over a year E009. Once corrected, plain -then- beats every published mixed-policy method at roughly half the . The structural lesson is that when a whole subfield's baselines flow through the same framework, independent replication becomes illusory and framework diversity is epistemic insurance.

Process rewards may outrank answers

A quiet inversion is underway in how we reward reasoning. Outcome-only converges on '' as a local optimum, where the answer is fixed at the first and the is decorative — detectable cheaply by truncating the trace and watching whether confidence is flat-high from the start E081. Operationalizing as a process reward produces a 9B model that beats far larger ones, and the flips conventional wisdom: removing the process reward hurts more than removing the correctness reward E079.

The weak is the field's hidden bottleneck. Letting a model learn good error-spotting with the answer key in hand and then taking the key away yields a critic that lifts an 8B model past a 235B one, and training inside the verification loop improves even the model's solo first attempts E099. The recurring caveat across all three: the reward signals are themselves LLM-generated, so whether models learn real or just its format remains open.

RL's invisible failure modes

Even when works, its standard dashboards lie. Entropy conflates within-prompt variation with dependence on the prompt, letting '' — fluent reasoning that quietly stopped reading the input — run undetected; filtering low-reward-variance prompts recovers a 16-point gain while cutting compute E010. Separately, task-completion RL silently makes worse at exploring unfamiliar environments, pushing error-recovery rates to zero, with a cheap interleaving fix E052. And the safety edge: models can already reason about sabotaging their own RL, and deterministic failure undermines more effectively than random failure by collapsing the advantage signal E007.

The constructive endpoint is that recipe, not scale, can carry olympiad-grade reasoning: a 30B open model matched top human scores using a and a two-stage verifiable-then-proof-quality reward progression E048.

Episodes anchoring this topic