Literature review · 6 episode(s)

Reinforcement learning and post-training for reasoning

← all topics  ·  Glossary →

What RL actually does to base models

A causal -level analysis finds RL-trained and base models agree on 97-99% of tokens, with disagreements concentrating at high- '' positions at 7-12x the average entropy — and a $25 contrastive on 50 problems can reproduce full RL accuracy on a 32B model E026. This recasts RL as calibration of an already-capable base rather than discovery of new strategies. The compositional-task counterpoint matters too: on multi-hop bridge questions RL does solve problems the base model can't reach at any sampling budget, while on the same data actively makes things worse — and the novelty lives in *reasoning over* retrieved text, not in generating new search queries E011. RL preserved strategy diversity 3x better than SFT in the same setup.

The practical implication is that the right knob may be *which* to teach. gates a contrastive by base-model and matches a $103k pipeline for ~$25 E026. turns this inside-out: instead of throwing away the 85% of that fail, add a next-token loss on the *terminal's responses* to get dense free supervision, roughly doubling pass rates on TerminalBench 2.0 E084.

Reward design beyond outcomes

Outcome-only RL with converges on as a local optimum, especially on hard problems where genuine reasoning rarely appears in — and using the model's own confidence as a process reward triples Countdown accuracy while improving as a side effect E081. Operationalising 's 1979 theory as a reward signal lets a 9B model beat more than ten times its size, with showing the *process* rewards do more work than the final-answer correctness reward — inverting a quietly held assumption E079.

The -tree work pushes this further: a single primitive that doubles as synthesis target, supervision filter, and RL reward unifies fact-seeking and report-writing training, and is shown to actively hurt open-ended report quality where RL with recovers it E082. And in the reward-model space itself, the scalar judge that wins by 40 points produces a *worse* policy than rubrics written by a frozen 1.7B judge — implying the benchmark and the training signal have drifted apart E019.

Stability and silent infrastructure bugs

A misplaced branch in 's -offloading silently discarded most micro-batch during accumulation, and a aggregation bug in mis-weighted updates when response lengths vary. Corrected SFT-then-RL beats every published mixed-policy method by 3.8 points on and 22 on at roughly half the FLOPs — the structural lesson being that when a whole subfield's baselines flow through the same library, framework diversity is epistemic insurance E009. Adjacent to this, — the field's go-to RL health metric — turns out to be blind to ',' where fluent varied-looking reasoning has quietly stopped depending on the input; a Shannon chain-rule decomposition plus filtering low-reward-variance prompts adds 16 points on at lower compute E010.

These two papers together cast doubt on more than a year of reasoning-model leaderboard claims and argue that careful baselines and instrumented training loops are now table stakes.

Exploration and capability frontiers

Task-focused drops exploration coverage and pushes error-recovery to literally zero in some setups; a five-to-one interleaving ratio improves both exploration and task success for ~17% training overhead E052. The axis matters too: can already reason about , and is meaningfully higher when training context is *inferred* from the environment than when it's stated outright — a fragile basis for safety guarantees E007. Counterintuitively, deterministic failure sabotages GRPO training more effectively than stochastic failure, because uniform low scores collapse the relative-advantage signal.

Data, environments, and the rest of the pipeline

A university team's 10k-example of an search beat an industrially-trained one on every benchmark by changing only the training data — making the case that imitation learning inherits the patience of its demonstrations and 'RL polish' is less necessary than assumed when the data is right E021. The pipeline inverts synthetic-data generation: execute real calls first, then back-chain tasks from observed outputs, so label correctness is structural rather than post-hoc — yielding a 4B model that matches 4.6 on tool calling for ~$47k E059.

For computer-use , the bottleneck wasn't the algorithm; it was the lack of verifiable training environments. A generator/discriminator information barrier (where the AI writing the reward function can't see the construction procedure) unlocked the largest open verified dataset, with environment diversity behaving like its own scaling axis E080. -grade proof-writing followed a similar recipe: , two-stage RL (cheap , then expensive proof-quality rewards), and ~100k- critique-iterate loops — matching the top human 2026 score at 30B parameters E048.

Episodes anchoring this topic