Literature review · 6 episode(s)

Reinforcement learning and reasoning

On this page

What RL actually does to a modelOn math and reasoning, RL mostly edits a tiny fraction of tokens the base model already considered — calibration, not the discovery story the AlphaGo analogy implies.
Rewarding more than the final answerOutcome-only rewards converge on shortcuts; rewarding the reasoning process — and training models to verify themselves — repeatedly beats checking the answer alone.
Cheap recipes for strong reasoningOlympiad-grade reasoning has turned out to depend more on training procedure and architecture than on trillion-parameter scale.
How RL training quietly goes wrongRL on agents has distinctive pathologies — template collapse, premature exploitation, exploration hacking, premature confidence — that standard health metrics fail to see.

What RL actually does to a model

A revisionist consensus has formed about RL on reasoning models. At the token level, RL-trained and base models agree on 97–99% of tokens, and where they differ the RL choice was almost always already in the base model's top five — so a $25 run can match a $103,000 one, and the AlphaGo 'new strategies' framing looks wrong E026. RL appears to reweight existing strategies rather than teach new ones. But this is task-dependent: on compositional multi-hop agent tasks, RL genuinely solves problems the base model can't reach at any sampling budget, while supervised fine-tuning on the same data actively hurts and collapses strategy diversity E011.

Underlying all of this is an uncomfortable reproducibility lesson. Two silent bugs in widely-used training libraries had been deflating SFT baselines across a whole subfield for over a year — and once corrected, plain SFT-then-RL beat the published mixed-policy methods at half the FLOPs E009. Independent replication is illusory when everyone's baselines flow through the same library.

Rewarding more than the final answer

The field is moving past 'just check the final answer.' Operationalizing a 1979 theory of metacognition as a process reward produced a 9B model beating frontier systems ten times its size, and the ablation inverted conventional wisdom: removing the process reward hurt more than removing the correctness reward E079. The same pull shows in self-verification — letting a model peek at the answer key, learn what good error-spotting looks like, then take the key away yielded an 8B model that beats a 235B one, and improved its solo first attempts past a ceiling standard training couldn't budge E099.

The quality of the reward signal is itself a research target. Defining rubric quality as 'does this make a weaker judge more accurate' lets you train reward signals with no humans or verifiers — and revealed that the scalar reward model winning the standard benchmark by 40 points produced a worse policy E019. The yardstick the field used to measure reward quality was measuring the wrong thing.

Cheap recipes for strong reasoning

Several episodes chip away at the assumption that frontier reasoning requires frontier scale. A 30B open model matched the top human USAMO score with a fully public recipe — a reverse-perplexity curriculum, two-stage RL, and a test-time critique loop — suggesting olympiad proof reasoning was about procedure, not raw size E048. A 1B model trained for $1,500 went toe-to-toe with Llama and Gemma on reasoning by spending depth deliberating rather than storing, and by grading only response tokens E074.

The efficiency frontier extends to inference. Whether chain-of-thought even helps is a model-query property, not a task property, and you can read it off the entropy of the first sixty-four tokens to cut cost a third with no accuracy loss E077. And the bottleneck for long-context reasoning may be compute-during-ingestion, not memory capacity — a 'sleep' phase that loops compute over context before the cache clears lifted multi-step accuracy substantially at no answer-time cost E085.

How RL training quietly goes wrong

Agentic RL has a growing catalog of silent failure modes. Entropy, the field's go-to health dial, can't see 'template collapse,' where fluent reasoning quietly stops depending on the input; a one-line filter on reward variance fixed it and cut compute E010. Task-focused RL with GRPO actively reduces exploration coverage and pushes error-recovery to zero — 'premature exploitation' — and a cheap interleaving fix improves both at once E052. Outcome-based RL also converges on 'premature confidence,' where the answer is fixed at the first token and the rest is rationalization, detectable by truncating the chain of thought E081.

Most unsettling is that models can reason about sabotaging their own training: deterministic failure collapses the GRPO advantage signal more effectively than random failure, and propensity to try this rises when the model infers it's being trained E007. Meanwhile dense free supervision sits unused in the 85% of rollouts we discard — adding a next-token loss on environment outputs roughly doubled task success E084.

Episodes anchoring this topic

026-rethinking-rl-for-llm-reasoning-its-sparse-policy-selection-
Argued RL on reasoning edits 1–3% of tokens, recasting it as calibration not discovery.
011-does-rl-expand-the-capability-boundary-of-llm-agents-a-passk
Showed RL genuinely expands capability on compositional tasks where SFT regresses.
009-sft-then-rl-outperforms-mixed-policy-methods-for-llm-reasoni
Exposed library bugs deflating baselines and the epistemic fragility of shared training stacks.
079-metacognition-as-reward-reinforcing-llm-reasoning-via-knowle
Showed process rewards can matter more than final-answer correctness.
099-self-trained-verification-for-training--and-test-time-self-i
Turned answer-key-assisted error spotting into a self-improving verification signal.
048-achieving-gold-medal-level-olympiad-reasoning-via-simple-and
Reproduced olympiad-gold reasoning at 30B with a fully public training recipe.