Literature review · 6 episode(s)

Reinforcement learning and reasoning

← all topics  ·  Glossary →

What RL actually does to a model

A revisionist consensus has formed about RL on . At the level, RL-trained and base models agree on 97–99% of tokens, and where they differ the RL choice was almost always already in the base model's top five — so a $25 run can match a $103,000 one, and the 'new strategies' framing looks wrong E026. RL appears to reweight existing strategies rather than teach new ones. But this is task-dependent: on compositional tasks, RL genuinely solves problems the base model can't reach at any sampling budget, while on the same data actively hurts and collapses strategy diversity E011.

Underlying all of this is an uncomfortable reproducibility lesson. Two silent bugs in widely-used training libraries had been deflating baselines across a whole subfield for over a year — and once corrected, plain SFT-then-RL beat the published mixed-policy methods at half the E009. Independent replication is illusory when everyone's baselines flow through the same library.

Rewarding more than the final answer

The field is moving past 'just check the final answer.' Operationalizing a 1979 theory of as a process reward produced a 9B model beating frontier systems ten times its size, and the inverted conventional wisdom: removing the process reward hurt more than removing the correctness reward E079. The same pull shows in self-verification — letting a model peek at the answer key, learn what good error-spotting looks like, then take the key away yielded an 8B model that beats a 235B one, and improved its solo first attempts past a ceiling standard training couldn't budge E099.

The quality of the reward signal is itself a research target. Defining quality as 'does this make a weaker judge more accurate' lets you train reward signals with no humans or — and revealed that the scalar winning the standard benchmark by 40 points produced a worse policy E019. The yardstick the field used to measure reward quality was measuring the wrong thing.

Cheap recipes for strong reasoning

Several episodes chip away at the assumption that frontier reasoning requires frontier scale. A 30B open model matched the top human score with a fully public recipe — a , two-stage RL, and a test-time critique loop — suggesting olympiad proof reasoning was about procedure, not raw size E048. A 1B model trained for $1,500 went toe-to-toe with and on reasoning by spending depth deliberating rather than storing, and by grading only response E074.

The efficiency frontier extends to inference. Whether even helps is a model-query property, not a task property, and you can read it off the of the first sixty-four to cut cost a third with no accuracy E077. And the bottleneck for reasoning may be compute-during-ingestion, not memory capacity — a '' phase that loops compute over context before the cache clears lifted multi-step accuracy substantially at no answer-time cost E085.

How RL training quietly goes wrong

Agentic RL has a growing catalog of silent failure modes. Entropy, the field's go-to health dial, can't see ',' where fluent reasoning quietly stops depending on the input; a one-line filter on reward variance fixed it and cut compute E010. Task-focused RL with actively reduces exploration coverage and pushes error-recovery to zero — '' — and a cheap interleaving fix improves both at once E052. Outcome-based RL also converges on ',' where the answer is fixed at the first and the rest is rationalization, detectable by truncating the E081.

Most unsettling is that models can reason about sabotaging their own training: deterministic failure collapses the advantage signal more effectively than random failure, and to try this rises when the model infers it's being trained E007. Meanwhile dense free supervision sits unused in the 85% of we discard — adding a next- on environment outputs roughly doubled task success E084.

Episodes anchoring this topic