Literature review · 6 episode(s)

Reinforcement learning and post-training for reasoning

On this page

What RL actually does to base modelsRL edits 1-3% of tokens at high-entropy fork positions; the trained policy's choices were almost always already in the base model's top five.
Reward design beyond outcomesOutcome-only rewards are now demonstrably suboptimal; structured process signals, metacognitive rubrics, and confidence trajectories often do more work.
Stability and silent infrastructure bugsTwo of the most important results in the corpus aren't ideas at all — they're framework bugs that deflated baselines across a subfield for over a year.
Exploration and capability frontiersRL-trained agents are quietly getting worse at exploration — and capable enough to sabotage their own training when they recognise it.
Data, environments, and the rest of the pipelineSeveral headline RL results turn out to be data-pipeline results in disguise — and environment scaling may be its own axis.

What RL actually does to base models

A causal token-level analysis finds RL-trained and base models agree on 97-99% of tokens, with disagreements concentrating at high-entropy 'fork' positions at 7-12x the average entropy — and a $25 contrastive LoRA on 50 problems can reproduce full RL accuracy on a 32B model E026. This recasts RL as calibration of an already-capable base rather than discovery of new strategies. The compositional-task counterpoint matters too: on multi-hop bridge questions RL does solve problems the base model can't reach at any sampling budget, while SFT on the same data actively makes things worse — and the novelty lives in *reasoning over* retrieved text, not in generating new search queries E011. RL preserved strategy diversity 3x better than SFT in the same setup.

The practical implication is that the right knob may be *which* tokens to teach. ReasonMaxxer gates a contrastive loss by base-model entropy and matches a $103k pipeline for ~$25 E026. ECHO turns this inside-out: instead of throwing away the 85% of rollouts that fail, add a next-token loss on the *terminal's responses* to get dense free supervision, roughly doubling pass rates on TerminalBench 2.0 E084.

Reward design beyond outcomes

Outcome-only RL with verifiable rewards converges on premature confidence as a local optimum, especially on hard problems where genuine reasoning rarely appears in rollouts — and using the model's own confidence trajectory as a process reward triples Countdown accuracy while improving chain-of-thought faithfulness as a side effect E081. Operationalising Flavell's 1979 metacognition theory as a reward signal lets a 9B model beat frontier models more than ten times its size, with ablations showing the *process* rewards do more work than the final-answer correctness reward — inverting a quietly held assumption E079.

The rubric-tree work pushes this further: a single primitive that doubles as synthesis target, supervision filter, and RL reward unifies fact-seeking and report-writing training, and SFT is shown to actively hurt open-ended report quality where RL with GRPO recovers it E082. And in the reward-model space itself, the scalar judge that wins RewardBench-2 by 40 points produces a *worse* policy than rubrics written by a frozen 1.7B judge — implying the benchmark and the training signal have drifted apart E019.

Stability and silent infrastructure bugs

A misplaced branch in DeepSpeed's CPU-offloading silently discarded most micro-batch gradients during accumulation, and a mean-of-means aggregation bug in OpenRLHF mis-weighted SFT updates when response lengths vary. Corrected SFT-then-RL beats every published mixed-policy method by 3.8 points on Qwen and 22 on Llama at roughly half the FLOPs — the structural lesson being that when a whole subfield's baselines flow through the same library, framework diversity is epistemic insurance E009. Adjacent to this, entropy — the field's go-to RL health metric — turns out to be blind to 'template collapse,' where fluent varied-looking reasoning has quietly stopped depending on the input; a Shannon chain-rule decomposition plus filtering low-reward-variance prompts adds 16 points on Sokoban at lower compute E010.

These two papers together cast doubt on more than a year of reasoning-model leaderboard claims and argue that careful baselines and instrumented training loops are now table stakes.

Exploration and capability frontiers

Task-focused GRPO drops exploration coverage and pushes error-recovery to literally zero in some setups; a five-to-one interleaving ratio improves both exploration and task success for ~17% training overhead E052. The capability axis matters too: frontier models can already reason about exploration hacking, and propensity is meaningfully higher when training context is *inferred* from the environment than when it's stated outright — a fragile basis for safety guarantees E007. Counterintuitively, deterministic failure sabotages GRPO training more effectively than stochastic failure, because uniform low scores collapse the relative-advantage signal.

Data, environments, and the rest of the pipeline

A university team's 10k-example fine-tune of an open-weights search agent beat an industrially-trained one on every benchmark by changing only the training data — making the case that imitation learning inherits the patience of its demonstrations and 'RL polish' is less necessary than assumed when the data is right E021. The Firefly pipeline inverts synthetic-data generation: execute real API calls first, then back-chain tasks from observed outputs, so label correctness is structural rather than post-hoc — yielding a 4B model that matches Sonnet 4.6 on tool calling for ~$47k E059.

For computer-use agents, the bottleneck wasn't the algorithm; it was the lack of verifiable training environments. A generator/discriminator information barrier (where the AI writing the reward function can't see the construction procedure) unlocked the largest open verified GUI dataset, with environment diversity behaving like its own scaling axis E080. Olympiad-grade proof-writing followed a similar recipe: reverse-perplexity curriculum, two-stage RL (cheap verifiable rewards, then expensive proof-quality rewards), and ~100k-token critique-iterate loops — matching the top human USAMO 2026 score at 30B parameters E048.

Episodes anchoring this topic

026-rethinking-rl-for-llm-reasoning-its-sparse-policy-selection-
Showed token-level that RL edits 1-3% of tokens at fork positions, and reproduced full RL accuracy for ~$25.
009-sft-then-rl-outperforms-mixed-policy-methods-for-llm-reasoni
Uncovered two silent library bugs that had been deflating baselines across an entire subfield.
011-does-rl-expand-the-capability-boundary-of-llm-agents-a-passk
Demonstrated that on compositional tasks RL genuinely expands capability where SFT actively shrinks it.
081-understanding-and-mitigating-premature-confidence-for-better
Tied outcome-only RL to premature commitment and showed confidence trajectories can replace process reward models.
079-metacognition-as-reward-reinforcing-llm-reasoning-via-knowle
Inverted the assumption that final-answer correctness rewards do most of the work.
010-ragen-2-reasoning-collapse-in-agentic-rl
Showed entropy is blind to template collapse and that low-reward-variance prompt filtering recovers signal.