Literature review · 6 episode(s)

Training Methods for Reasoning Models

← all topics  ·  Glossary →

What RL actually does at the token level

RL-trained and base models agree on 97–99% of , and where they differ the RL choice is almost always already in the base model's top five — disagreements concentrate at high- '' positions, suggesting RL is calibrating rather than discovering E026. A 32B model can match the full RL pipeline at roughly $25 by targeting only those positions with a contrastive . But this picture is task-dependent: on compositional multi-hop questions, RL solves problems no sampling budget rescues for the base model while on the same data actively hurts E011. The reconciliation is that RL reweights existing strategies — and how much that matters depends on whether the base model already had the right ones.

The quiet companion finding is that mixed-policy methods that dominated benchmarks for a year were partly chasing two silent library bugs that deflated baselines; corrected SFT-then-RL beats every published mixed-policy method on math at roughly half the FLOPs E009. The deeper warning is that when an entire subfield's baselines flow through the same library, independent replication becomes illusory.

Rewarding process rather than outcomes

On long-horizon web tasks, milestone-based shaping rewards lift a 12B open model from 6% to 43%, with the same idea doing double duty as runtime scaffolding E008. Operationalising 's 1979 theory as reward components beats outcome-only baselines on reasoning, and the flips the field's conventional wisdom — removing process rewards hurts more than removing the correctness reward E079. A confidence- replaces expensive process entirely and triples accuracy on hard problems while improving as a side effect E081.

Reward design itself is becoming a target. Trained outperform the scalar that wins by 40 points when used to actually train policies E019, and an information-theoretic decomposition shows that the standard dial points the wrong direction and that filtering by reward variance jumps accuracy 16 points with less compute E010. A formal account of iterative shows and are predicted equilibria of an optimizer dropping a term, and a one-line fix partially closes the gap E025.

Curriculum, distillation, and self-improvement loops

-level proof writing on a 30B open model fell to a plus a two-stage RL progression — cheap first, expensive proof-quality rewards second — matching the top human score on 2026 E048. A tiny 1B model trained on $1,500 of compute reaches / reasoning territory using a hybrid architecture and a sharper objective E074. Pretraining-style efficiency, in other words, was partly hiding behind curriculum and objective choices.

Self-improvement is bifurcating into two genuinely different levers: scaffold/prompt edits and updates. A trained Markdown 'skill file' transferred across two different lifts spreadsheet performance 60 points with no retraining E078, while a system allowed to retrain its own weights closes a 20% genomic-denoising gap that thousands of scaffold rewrites could not E088. Free supervision is also being found in the data we used to throw away — adding next- on the terminal's responses inside failed roughly doubles task success on E084. Agent architecture itself is now a search target, with LLM agents discovering training-script improvements like focal-loss substitution that outperform human-tuned references E053.

What RL still cannot do

Task-focused RL with drops exploration coverage to zero on unfamiliar environments, and a five-to-one interleaving fix is needed to recover it for a 17% training-overhead E052. RL also struggles when reasoning is decorative — outcome-based RL converges on as a local optimum because genuine reasoning rarely appears in the distribution on hard problems E081. The framework around training has to do work the model cannot: organising into smaller specialised teams sharing a parameter budget can nearly double accuracy of one agent with the same budget E060, and putting recursive delegation inside the RL loop (rather than around a model) produces on long-horizon tasks E028.

The negation-neglect work is the sharpest reminder that descent has its own inductive biases — train on documents loudly labelled false and models believe them anyway, with warnings cutting belief roughly in half rather than removing it E043. SGD finds the wrong even when a correct one exists.

Episodes anchoring this topic