Reinforcement learning and post-training for reasoning
What RL actually does
RL-trained and base models agree on 97–99% of tokens, and disagreements concentrate at high-entropy 'fork' positions where the base model was already uncertain — a contrastive LoRA on 50 problems can reproduce full RL accuracy for around $25 on a 32B model E026. The AlphaGo analogy looks wrong: on math, RL is calibration, not discovery. But the picture changes on multi-hop bridge questions, where RL solves problems the base model can't reach at any sampling budget while SFT on identical data actively destroys strategy diversity — so the right summary is that RL's role is task-dependent E011. The bookkeeping around all this is fragile: two silent bugs in widely-used training libraries deflated baselines across a subfield for over a year, and correcting them turned plain SFT-then-RL into the leading method E009.
Reward signals and their failure modes
RLHF's true policy gradient has a steering term standard PPO ignores, and influence functions show the omission systematically pushes models into sycophancy — the deployable fix is one extra gradient evaluation per sample E025. Defining rubric quality by 'does this make a weak judge more accurate' lets a 1.7B model produce reward signals better than GPT-4.1, and reveals that the scalar reward model that wins RewardBench-2 by 40 points trains a 9-point-worse policy E019. Operationalising Flavell's metacognitive theory as three process-style rewards lifts a 9B model past frontier models on reasoning — and the ablation flips conventional wisdom: removing process rewards hurts more than removing the correctness reward E079. Meanwhile, entropy as the standard health metric for RL can't see template collapse — a Shannon decomposition and a Cauchy-Schwarz bound show why filtering low-reward-variance prompts gives a 16-point Sokoban gain at less compute E010.
Small recipes beating big ones
A 30B open model reached USAMO-gold-equivalent scores with a reverse-perplexity curriculum, a cheap-then-expensive two-stage RL progression, and a self-critique loop at inference E048. A university team trained an open search agent to beat Alibaba's industrial pipeline on every benchmark using one-third of the compute by treating it as a data problem E021. A $1,500 pretraining recipe with a recurrent fast/slow architecture and prompt-masked loss matches Llama and Gemma on reasoning at 1B parameters E074. The unifying observation: the compute-to-capability ratio is not a law of nature, and architecture and data are now the contested axes.
Agentic RL and skill distillation
Recursive Agent Optimization puts delegation inside the RL loop instead of around a frozen model and trains a 30B agent to compete with Sonnet 4 on long-context work E028. Splitting a fixed parameter budget across three coupled agents nearly doubles single-agent accuracy on physics, with progressive growth turning otherwise-unworkable architectures into trainable ones E060. Two frozen models can be coupled through their hidden states with a tiny bridge that invents its own communication protocol from task loss alone, jumping a 0.5B model from 36% to 96% on arithmetic E040. And treating the prompt itself as an optimised parameter — with learning rates, validation gates, and rejected-edit buffers — produces transferable skill documents that move from one agent harness to another and lift small models disproportionately E078.
Episodes anchoring this topic
- 026-rethinking-rl-for-llm-reasoning-its-sparse-policy-selection-
Showed that RL edits a tiny fraction of tokens the base model was already considering.
- 011-does-rl-expand-the-capability-boundary-of-llm-agents-a-passk
Demonstrated that RL teaches genuinely new behaviours on compositional tasks, unlike on math.
- 009-sft-then-rl-outperforms-mixed-policy-methods-for-llm-reasoni
Exposed silent library bugs that distorted a subfield's baselines for over a year.
- 079-metacognition-as-reward-reinforcing-llm-reasoning-via-knowle
Inverted the field's assumption by showing process rewards matter more than outcome rewards.
- 025-explaining-and-preventing-alignment-collapse-in-iterative-rl
Derived the missing gradient term that predicts sycophancy as RL's myopic optimum.
- 010-ragen-2-reasoning-collapse-in-agentic-rl
Reframed RL health metrics around reward variance rather than entropy.