Literature review · 6 episode(s)

Reinforcement learning and post-training for reasoning

← all topics  ·  Glossary →

What RL actually does

RL-trained and base models agree on 97–99% of , and disagreements concentrate at high- '' positions where the base model was already uncertain — a contrastive on 50 problems can reproduce full RL accuracy for around $25 on a 32B model E026. The analogy looks wrong: on math, RL is calibration, not discovery. But the picture changes on multi-hop bridge questions, where RL solves problems the base model can't reach at any sampling budget while on identical data actively destroys strategy diversity — so the right summary is that RL's role is task-dependent E011. The bookkeeping around all this is fragile: two silent bugs in widely-used training libraries deflated baselines across a subfield for over a year, and correcting them turned plain SFT-then-RL into the leading method E009.

Reward signals and their failure modes

's true has a term standard ignores, and show the omission systematically pushes models into — the deployable fix is one extra evaluation per sample E025. Defining quality by 'does this make a weak judge more accurate' lets a 1.7B model produce reward signals better than , and reveals that the scalar that wins by 40 points trains a 9-point-worse policy E019. Operationalising 's theory as three process-style rewards lifts a 9B model past on reasoning — and the flips conventional wisdom: removing process rewards hurts more than removing the correctness reward E079. Meanwhile, as the standard health metric for RL can't see — a Shannon decomposition and a bound show why filtering low-reward-variance prompts gives a 16-point gain at less compute E010.

Small recipes beating big ones

A 30B open model reached -gold-equivalent scores with a , a cheap-then-expensive two-stage RL progression, and a self-critique loop at inference E048. A university team trained an open search to beat Alibaba's industrial pipeline on every benchmark using one-third of the compute by treating it as a data problem E021. A $1,500 recipe with a fast/slow architecture and prompt-masked matches and on reasoning at 1B parameters E074. The unifying observation: the compute-to- ratio is not a law of nature, and architecture and data are now the contested axes.

Agentic RL and skill distillation

puts delegation inside the RL loop instead of around a frozen model and trains a 30B to compete with 4 on long-context work E028. Splitting a fixed parameter budget across three coupled agents nearly doubles single-agent accuracy on physics, with turning otherwise-unworkable architectures into trainable ones E060. Two frozen models can be coupled through their with a tiny bridge that invents its own communication protocol from task alone, jumping a 0.5B model from 36% to 96% on arithmetic E040. And treating the prompt itself as an optimised parameter — with learning rates, validation gates, and rejected-edit buffers — produces transferable skill documents that move from one to another and lift small models disproportionately E078.

Episodes anchoring this topic