Literature review · 6 episode(s)

Reinforcement learning and post-training for reasoning

On this page

What RL actually doesOn math reasoning, RL edits 1–3% of tokens the base model was already considering; on compositional tasks it can teach genuinely new behaviours.
Reward signals and their failure modesOutcome rewards alone can degrade reasoning, scalar reward models that win benchmarks can produce the worst policies, and process-style supervision is doing more work than the field assumed.
Small recipes beating big onesOlympiad-level reasoning and frontier search-agent quality are now within reach of small open-source training runs when the recipe is right.
Agentic RL and skill distillationTraining is moving past frozen models with scaffolding toward weights, harnesses, and skill documents that are jointly optimised against task signal.

What RL actually does

RL-trained and base models agree on 97–99% of tokens, and disagreements concentrate at high-entropy 'fork' positions where the base model was already uncertain — a contrastive LoRA on 50 problems can reproduce full RL accuracy for around $25 on a 32B model E026. The AlphaGo analogy looks wrong: on math, RL is calibration, not discovery. But the picture changes on multi-hop bridge questions, where RL solves problems the base model can't reach at any sampling budget while SFT on identical data actively destroys strategy diversity — so the right summary is that RL's role is task-dependent E011. The bookkeeping around all this is fragile: two silent bugs in widely-used training libraries deflated baselines across a subfield for over a year, and correcting them turned plain SFT-then-RL into the leading method E009.

Reward signals and their failure modes

RLHF's true policy gradient has a steering term standard PPO ignores, and influence functions show the omission systematically pushes models into sycophancy — the deployable fix is one extra gradient evaluation per sample E025. Defining rubric quality by 'does this make a weak judge more accurate' lets a 1.7B model produce reward signals better than GPT-4.1, and reveals that the scalar reward model that wins RewardBench-2 by 40 points trains a 9-point-worse policy E019. Operationalising Flavell's metacognitive theory as three process-style rewards lifts a 9B model past frontier models on reasoning — and the ablation flips conventional wisdom: removing process rewards hurts more than removing the correctness reward E079. Meanwhile, entropy as the standard health metric for RL can't see template collapse — a Shannon decomposition and a Cauchy-Schwarz bound show why filtering low-reward-variance prompts gives a 16-point Sokoban gain at less compute E010.

Small recipes beating big ones

A 30B open model reached USAMO-gold-equivalent scores with a reverse-perplexity curriculum, a cheap-then-expensive two-stage RL progression, and a self-critique loop at inference E048. A university team trained an open search agent to beat Alibaba's industrial pipeline on every benchmark using one-third of the compute by treating it as a data problem E021. A $1,500 pretraining recipe with a recurrent fast/slow architecture and prompt-masked loss matches Llama and Gemma on reasoning at 1B parameters E074. The unifying observation: the compute-to-capability ratio is not a law of nature, and architecture and data are now the contested axes.

Agentic RL and skill distillation

Recursive Agent Optimization puts delegation inside the RL loop instead of around a frozen model and trains a 30B agent to compete with Sonnet 4 on long-context work E028. Splitting a fixed parameter budget across three coupled agents nearly doubles single-agent accuracy on physics, with progressive growth turning otherwise-unworkable architectures into trainable ones E060. Two frozen models can be coupled through their hidden states with a tiny bridge that invents its own communication protocol from task loss alone, jumping a 0.5B model from 36% to 96% on arithmetic E040. And treating the prompt itself as an optimised parameter — with learning rates, validation gates, and rejected-edit buffers — produces transferable skill documents that move from one agent harness to another and lift small models disproportionately E078.

Episodes anchoring this topic

026-rethinking-rl-for-llm-reasoning-its-sparse-policy-selection-
Showed that RL edits a tiny fraction of tokens the base model was already considering.
011-does-rl-expand-the-capability-boundary-of-llm-agents-a-passk
Demonstrated that RL teaches genuinely new behaviours on compositional tasks, unlike on math.
009-sft-then-rl-outperforms-mixed-policy-methods-for-llm-reasoni
Exposed silent library bugs that distorted a subfield's baselines for over a year.
079-metacognition-as-reward-reinforcing-llm-reasoning-via-knowle
Inverted the field's assumption by showing process rewards matter more than outcome rewards.
025-explaining-and-preventing-alignment-collapse-in-iterative-rl
Derived the missing gradient term that predicts sycophancy as RL's myopic optimum.
010-ragen-2-reasoning-collapse-in-agentic-rl
Reframed RL health metrics around reward variance rather than entropy.