Literature review · 6 episode(s)

Training Methods for Reasoning Models

On this page

What RL actually does at the token levelRecent mechanistic work argues RL on reasoning edits one to three percent of tokens that the base model was already considering, which both narrows and clarifies what training is for.
Rewarding process rather than outcomesOutcome-only RL converges on shallow heuristics; introducing process-shaped rewards — milestones, metacognition, structured rubrics — both stabilises training and lifts ceilings.
Curriculum, distillation, and self-improvement loopsBeyond reward design, the field has converged on staged curricula, prompt-as-parameter optimisation, and self-improvement loops that treat scaffolding and weights as independent levers.
What RL still cannot doEven with better rewards and curricula, RL inherits the base model's strategy distribution and can collapse exploration in ways that take real engineering to recover.

What RL actually does at the token level

RL-trained and base models agree on 97–99% of tokens, and where they differ the RL choice is almost always already in the base model's top five — disagreements concentrate at high-entropy 'fork' positions, suggesting RL is calibrating rather than discovering E026. A 32B model can match the full RL pipeline at roughly $25 by targeting only those positions with a contrastive loss. But this picture is task-dependent: on compositional multi-hop questions, RL solves problems no sampling budget rescues for the base model while SFT on the same data actively hurts E011. The reconciliation is that RL reweights existing strategies — and how much that matters depends on whether the base model already had the right ones.

The quiet companion finding is that mixed-policy methods that dominated benchmarks for a year were partly chasing two silent library bugs that deflated SFT baselines; corrected SFT-then-RL beats every published mixed-policy method on math at roughly half the FLOPs E009. The deeper warning is that when an entire subfield's baselines flow through the same library, independent replication becomes illusory.

Rewarding process rather than outcomes

On long-horizon web tasks, milestone-based shaping rewards lift a 12B open model from 6% to 43%, with the same idea doing double duty as runtime scaffolding E008. Operationalising Flavell's 1979 metacognition theory as reward components beats outcome-only baselines on reasoning, and the ablation flips the field's conventional wisdom — removing process rewards hurts more than removing the correctness reward E079. A confidence-trajectory probe replaces expensive process reward models entirely and triples accuracy on hard problems while improving faithfulness as a side effect E081.

Reward design itself is becoming a target. Trained rubrics outperform the scalar reward model that wins RewardBench-2 by 40 points when used to actually train policies E019, and an information-theoretic decomposition shows that the standard entropy dial points the wrong direction and that filtering by reward variance jumps Sokoban accuracy 16 points with less compute E010. A formal account of iterative RLHF shows sycophancy and reward hacking are predicted equilibria of an optimizer dropping a steering term, and a one-line gradient fix partially closes the gap E025.

Curriculum, distillation, and self-improvement loops

Olympiad-level proof writing on a 30B open model fell to a reverse-perplexity curriculum plus a two-stage RL progression — cheap verifiable rewards first, expensive proof-quality rewards second — matching the top human score on USAMO 2026 E048. A tiny 1B model trained on $1,500 of compute reaches Llama/Gemma reasoning territory using a hybrid recurrent architecture and a sharper objective E074. Pretraining-style efficiency, in other words, was partly hiding behind curriculum and objective choices.

Self-improvement is bifurcating into two genuinely different levers: scaffold/prompt edits and weight updates. A trained Markdown 'skill file' transferred across two different agent harnesses lifts spreadsheet performance 60 points with no retraining E078, while a system allowed to retrain its own weights closes a 20% genomic-denoising gap that thousands of scaffold rewrites could not E088. Free supervision is also being found in the data we used to throw away — adding next-token loss on the terminal's responses inside failed rollouts roughly doubles task success on TerminalBench E084. Agent architecture itself is now a search target, with LLM agents discovering training-script improvements like focal-loss substitution that outperform human-tuned references E053.

What RL still cannot do

Task-focused RL with GRPO drops exploration coverage to zero on unfamiliar environments, and a five-to-one interleaving fix is needed to recover it for a 17% training-overhead E052. RL also struggles when reasoning is decorative — outcome-based RL converges on premature confidence as a local optimum because genuine reasoning rarely appears in the rollout distribution on hard problems E081. The framework around training has to do work the model cannot: organising agents into smaller specialised teams sharing a parameter budget can nearly double accuracy of one agent with the same budget E060, and putting recursive delegation inside the RL loop (rather than around a frozen model) produces phase transitions on long-horizon tasks E028.

The negation-neglect work is the sharpest reminder that gradient descent has its own inductive biases — train on documents loudly labelled false and models believe them anyway, with warnings cutting belief roughly in half rather than removing it E043. SGD finds the wrong basin even when a correct one exists.

Episodes anchoring this topic

009-sft-then-rl-outperforms-mixed-policy-methods-for-llm-reasoni
Exposed two silent library bugs that quietly invalidated a wave of mixed-policy reasoning papers.
026-rethinking-rl-for-llm-reasoning-its-sparse-policy-selection-
Demonstrated that RL on reasoning edits 1–3% of tokens already in the base model's distribution.
011-does-rl-expand-the-capability-boundary-of-llm-agents-a-passk
Showed RL genuinely expands capability on compositional tasks while SFT collapses strategy diversity.
079-metacognition-as-reward-reinforcing-llm-reasoning-via-knowle
Found that process rewards may matter more than outcome correctness — flipping conventional RLVR wisdom.
081-understanding-and-mitigating-premature-confidence-for-better
Identified premature confidence as a local optimum of outcome-based RL and a confidence-probe fix.
048-achieving-gold-medal-level-olympiad-reasoning-via-simple-and
Demonstrated that olympiad proof reasoning is reachable at 30B with the right curriculum and two-stage RL.