The Free Step-Level Grader Hiding in Every RL Training Run
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
The trick that lets a language model double as its own reward model was supposed to die the moment models became agents that browse, call tools, and send irreversible emails. This paper argues it never died — researchers were just reading the wrong number off it, and the fix is one subtraction. The payoff is a step-level grader you already own that beats trained reward models and, on one split, beats Claude as a judge.
What you'll take away
- Why step-level scoring is blocked for agents — you can't Monte Carlo irreversible actions, hand-labeling is prohibitive, and dedicated PRMs don't transfer across tasks
- How the 'progress advantage' falls out for free: log-ratio of the trained model's action probability to its pre-RL reference recovers the optimal advantage
- The one subtraction (Q minus V) that makes the old reward-recovery trick survive in stochastic agent environments where it should have broken
- Why subtracting the reference turns a fluency judge into an expertise judge — rare tool-call syntax stops being penalized
- The numbers: ~11-16 point gains in test-time scaling, 0.87 vs Claude's 0.62 AUROC on airline customer service, all for ~46 GPU-hours on one A100
- The honest catch: the theory is exact only for an optimal RL policy, the method picks best aggregation per task, and some headline AUROCs come from 50-100 trajectories with unreported error bars
Chapters
- 01:33Why grading an agent is so hard
- 03:39Grading on a curve, mathematically
- 05:22The number already in your pipeline
- 07:19The trick that died for agents
- 11:37When confidence punishes the right answer
- 14:04Does it actually win on real tasks?
- 17:31The soft spot they admit to
- 20:16The byproduct labs throw away
References in this episode
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model — The 'secretly a reward model' result the episode builds on and pushes past — its
- Let's Verify Step by Step — The canonical process reward model paper, grounding the episode's framing of why
- Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations — The Monte-Carlo-rollout approach to estimating step quality that the episode inv
- Proximal Policy Optimization Algorithms — The clipping-based RL recipe the episode notes enforces an 'implicit leash,' exp
Full transcript
Also available as a plain-text transcript page.
0:00Bella: Here's a result this field has quoted for years: your language model is secretly a reward model. Train a model with reinforcement learning, and the trained model quietly contains a grader you can read back out — no separate scorer needed. And the moment people tried to use that trick on AI agents — the ones that browse, call tools, run code, send emails — it broke. This paper argues it never actually broke. The grader was there the whole time. Researchers were just reading the wrong number off it. Quick heads up before we go further — this is an AI-made explainer, both voices included.
0:40Tyler: And the payoff is almost rude. A signal that falls out of standard RL training — no extra labeling, no dedicated model — beats reward models built and trained specifically for the job. And in one test we'll get to, it beats Claude as a judge by a margin that, on paper, should not be possible.
0:59Bella: By the end you'll understand exactly where that free signal comes from, why everyone assumed it died when models became agents, and the one-line fix that brought it back. And this matters because we are handing real actions to these agents right now — and we mostly can't tell, step by step, whether an agent is making progress or quietly wandering toward a mistake. That blind spot is the whole game for picking good runs and catching failures before they land.
1:30Tyler: So let me name the puzzle before the fix, because the fix only feels clever once you feel the wall. Why is grading an agent step by step supposed to be so hard?
1:41Bella: Right — and it's worth being precise about the wall. You've got two ways to score behavior. An outcome reward model gives you one number for the whole run: did the agent ultimately succeed? Over a hundred steps, that's nearly useless — it can't tell you which step was the good one. What you actually want is a process reward model, a PRM, that scores each step: was this move progress? In math reasoning, people build those by brute force. Take a step, roll forward from it a hundred times, see how often you end up correct. That estimates the step's quality. But an agent that just deleted a file, or sent an email, or charged a credit card — it can't rewind and re-roll. The world has moved on and it won't reset.
2:28Tyler: And that's the part that kills every standard recipe at once. You can't Monte Carlo because actions are irreversible. You can't hand-label, because annotating step quality across millions of agent trajectories is prohibitively expensive. And the dedicated PRMs people do train notoriously fail to transfer — train one on shopping, it falls apart on customer service.
2:52Bella: Which gives you this genuinely cruel irony the authors state outright: the agents that most need step-level evaluation are exactly the ones for which building a step-level scorer is least feasible. The harder the agent's world, the more blind you are.
3:09Tyler: So the question the paper asks is almost greedy. Can you get step-level scoring for an agent for free — no annotation, no re-rolling, no dedicated reward model? And the answer is that the signal is already sitting in your training pipeline, a byproduct of the RL post-training every modern model already goes through.
3:30Bella: To see what that byproduct is, you need one idea from reinforcement learning — and it's the conceptual heart of the whole paper. Tyler, you want to take the advantage function, because everything hinges on it.
3:44Tyler: Happily. Think about grading students. The naive way is raw score — an eighty beats a seventy. But a seventy on a brutal exam where the class averaged fifty reflects more skill than a ninety on one where everyone got ninety-five. So you grade on a curve: how much did this student beat the average, on this particular test? That curve is the advantage function. In RL you've got two quantities. The value of a situation is "how well do I expect to do from here if I just play normally." The Q-value is "how well do I do if I take this specific action, then play normally." The advantage is just the second minus the first — how much better was this exact move than my typical move in this exact situation.
4:38Bella: And the reason that subtraction matters so much is that raw reward smears two different things together. How hard is the situation, and how good was the choice. Advantage strips out the difficulty and keeps only the decision quality — which is precisely what you want when you're comparing steps inside one trajectory, or comparing runs that hit different conditions.
5:05Tyler: Exactly. Reward says "you scored a sixty." Advantage says "a sixty here was actually excellent, because this was a nightmare problem." For credit assignment, the curve is the right number — not the raw score.
5:21Bella: So now the reveal. The authors prove that for any model trained with the standard RL recipe, this optimal advantage — the curve, the thing you wanted — is recovered exactly by comparing two probabilities. You take the trained model's probability of an action, divide by the original pre-RL model's probability of that same action, take the log, scale it. That number is the advantage. They call it the progress advantage.
5:53Tyler: And to make sure that lands — the "original pre-RL model" has a name. When labs do this RL stage, they put the model on a leash: get better at the goal, but don't drift too far from where you started. "Where you started" is the reference policy — usually the base or supervised checkpoint. The leash is what keeps the model from becoming a stranger. And it turns out that leash is exactly what makes the math come out clean.
6:22Bella: So picture the overview figure. On the left, RL training spits out a pair — the trained model and its reference. On the right, you take any trajectory, and for every single action you compute that log-ratio, and out pops a number that says: is this step making progress? No reward model. No labels. Two checkpoints you already had.
6:45Tyler: It's like discovering the receipts you kept for taxes are also a complete, itemized record of your spending — a budgeting tool you already owned and never opened. Except — and I want to plant this now, because it matters later — the math holds exactly for the optimal RL policy. The checkpoints you actually download are approximations of that. How big the gap is between "optimal" and "what's on the shelf" is the soft spot in this whole story, and we'll come back to it.
7:17Bella: Fair flag. But before the soft spot, there's a hard problem we skated past — because this exact trick already existed, and it was supposed to be dead for agents. So why isn't it?
7:30Tyler: That's the real story, and it's the densest stretch of the paper — but it comes down to one subtraction, and that subtraction kills a problem that should have made this whole thing impossible for agents. Let me build it. The "secretly a reward model" lineage — the DPO line of work — showed that the log-ratio recovers the reward. But that derivation quietly assumed a deterministic world. Deterministic means your action fully determines what happens next. Plain text generation is like this — you append a token, you get a longer string, nothing surprises you.
8:09Bella: And the reason determinism made the old trick work is a bookkeeping thing, right? Walk me through why summing it up cancels so cleanly.
8:18Tyler: Picture tracking your net worth turn by turn in a game where every gain or loss is purely the result of your own move. You add up "how much did each move change my position," and all the intermediate bookkeeping cancels down the chain — the total just reflects your decisions. That clean cancellation, that's what mathematicians call a telescoping sum, and it's why the log-ratio equals the reward in the deterministic world. Now change the game. After each of your moves, a random event also hits your account — a market swing you didn't cause. An agent's world is this second game. You call a tool and it errors. You search and get something unexpected. A user replies in a way you couldn't predict. That's a stochastic environment — something external and unpredictable enters after you act.
9:13Bella: And in that game the cancellation falls apart.
9:16Tyler: It does. Every step now leaves behind a leftover term — the gap between where you are and the average of where the randomness might throw you next. Those leftovers don't telescope away, and they depend on the value function, which you cannot read off the two models. So recovering the reward genuinely becomes impossible. Your decision's worth is tangled up with the environment's luck, and you can't separate skill from luck using probabilities alone.
9:48Bella: So this is the moment the old trick dies for agents. And the fix is — don't chase the reward.
9:54Tyler: That's the pivot, and it's beautiful. Stop trying to recover the reward, and recover the advantage instead. Remember the advantage is a difference — Q minus V. Both of those quantities absorb the exact same random future term. So when you subtract them, the luck cancels identically. You're left with pure skill — the action quality — without ever modeling the environment's randomness. The reward-recovery worked by algebraic luck and luck ran out. The advantage isolates the right thing by definition.
10:29Bella: So the headline isn't really "free reward model." It's that switching the target from reward to advantage is what survives the jump from text into the messy world agents actually live in. The spirit of the old result holds; the quantity changes.
10:47Tyler: That's the contribution in one breath. And one footnote so nobody emails us: this isn't limited to recipes with an explicit leash. The clipping-based methods — the ones that just constrain how far each update can move — turn out to enforce a leash implicitly too. So progress advantage covers essentially all the mainstream RL recipes, not a niche one.
11:12Bella: So far: agents need step-level scoring, every normal way of getting it is blocked, and it turns out the score was hiding in the gap between your trained model and where it started. Now — why does reading that gap beat just asking the model how confident it is? Because that's the comparison that makes this visceral.
11:34Tyler: This is the single best example in the paper, and it's worth slowing down for. Bella, take the flight case — it's yours.
11:43Bella: So. An agent is handling a customer service request — cancel my flight. And the correct move, per airline policy, is to refuse: it's basic economy, no insurance, booked more than twenty-four hours ago. The agent gets it right. It declines, in the proper format, citing the right constraints. Now here's what the screen shows. If you score that agent by raw probability — just how confident the model is in its own words — it penalizes the correct answer. Why? Because the right answer is full of rare text. Tool-call syntax is weirder than plain English. Domain phrases like "change of plan" and "business class" are uncommon. So a confidence-only judge sees rare tokens and says: low probability, this looks wrong.
12:34Tyler: Which is exactly backwards. The rare strings are the right answer. The model sounds "uncertain" only because expertise doesn't look like generic fluent chatter.
12:45Bella: And watch what the progress advantage does to the same trajectory. You subtract the reference model's probability. The reference also finds tool syntax rare — so that "this is just unusual text" penalty cancels right out. What survives is only: how much more likely is the trained, goal-directed model to say this than the baseline? And suddenly every one of those correct-but-rare moves flips to positive. The reference offset turns a fluency judge into an expertise judge.
13:19Tyler: It's the difference between grading a job interview by how smoothly someone talks, versus grading by how much more likely an expert is to say this than a random fluent person. The first punishes precise jargon. The second rewards it — because the jargon is the signature of knowing what you're doing.
13:40Bella: And the paper backs that intuition with a clean ablation. Use only the trained model's probability, or only the reference's — each lands a mediocre score, ranking around two-point-three on average. The ratio of the two ranks one-point-four. The subtraction isn't decoration; it's where the signal actually lives.
14:02Tyler: So the mechanism is settled. The obvious question is whether a quantity this clean actually wins on real tasks, against people who spent real effort building scorers. Does it hold up?
14:15Bella: Let's take it one application at a time, because they tested three. First, test-time scaling — the most practical one. Generate eight candidate runs, score them, pick the best, measure success. If the progress advantage is a good judge, picking with it should beat a single greedy attempt and beat trained reward models. And it does. On a small Gemma model, success goes from thirty-three percent with greedy decoding up to thirty-nine with progress-advantage selection. On a mid-size Qwen, fifty-five up to sixty-two. The best trained reward model they compared against — a model twice the size — barely moves the needle off greedy, sitting around thirty-four and fifty-five. Across four benchmarks the margin is roughly eleven to sixteen points over baselines.
15:05Tyler: And the framing I can't get past — on the shopping benchmark it outperforms a PRM that was trained directly on that exact task. A training-free signal, beating the model built specifically for the job.
15:19Bella: Right. And to be honest about the ceiling: if an oracle could always pick the best of the eight, you'd get about forty-five and sixty-seven percent. So progress advantage closes most of the gap between one shot and the best-possible pick — not all of it, but most.
15:36Tyler: Second application is the one that made me do a double take. Uncertainty quantification — can the score predict, in advance, whether a run will succeed or fail? That's your runtime safety monitor, the thing that flags a trajectory before it goes off a cliff.
15:54Bella: And this is where the Claude comparison lives. Set it up for me — why is that result supposed to be surprising?
16:02Tyler: Because the contender is a log-ratio of two open checkpoints, and the champion is Claude Sonnet, a frontier proprietary model, used as a judge. On the airline customer-service task, the progress advantage hits an AUROC of about zero-point-eight-seven. Claude sits at about zero-point-six-two. Zero-point-five is a coin flip. So the free signal isn't edging Claude out — it's in a different weight class on that split.
16:31Bella: And it generalizes in a way I didn't expect. The Gemma model's checkpoint pair can score another model's trajectories. It predicts the bigger Qwen models' success at around zero-point-seven-five and zero-point-seven-three — far above the confidence-only baselines down in the coin-flip range. So you can bolt it on as an external monitor for a model that isn't even the one you trained.
16:58Tyler: Third application, quickly, because the pattern's clear by now — failure attribution. In a multi-agent system that failed, which step caused it? Progress advantage finds the culprit step nearly as well as a method that was trained specifically to do failure attribution. Same theme: no task-specific training, rivaling the thing built for the task.
17:22Bella: And the cost footnote is almost funny. Reproducing all three applications takes about forty-six GPU-hours on a single A100. The whole memory cost is just holding two copies of the model. That's the entire apparatus.
17:36Tyler: So this all sounds close to too good. Let me be the skeptic, because the paper is genuinely strong and it's strongest exactly where I want to push. Go back to the flag I planted. The theory is exact for the optimal RL policy. But the checkpoints you download are approximations — trained with configs you don't know, maybe under-trained, maybe over-regularized. The authors concede this directly: the assumption that a public model is near-optimal is, in their words, barely falsifiable, because you usually don't know the training setup. So the clean "exactly equals the advantage" — that's the theory. What's actually being computed on real checkpoints is an approximation of unknown quality, and they call for controlled experiments to check it that they didn't run.
18:28Bella: That's fair, and I'd add the wins aren't uniform. The headline is an average.
18:33Tyler: Right, and the average hides the weak case. On the retail customer-service split, progress advantage is merely competitive — Claude scores around zero-point-eight-five and zero-point-nine on the two newest models, versus progress advantage's roughly zero-point-six-nine. So "surpasses dedicated reward models" is true on average and false in specific places. And two more things bug me. They pick the best aggregation — how you roll per-token scores into a step score — per task, and they admit it meaningfully changes results with no universal winner. Choosing that against the benchmark risks tuning to it. And some of these AUROC numbers come from fifty or a hundred trajectories. Differences of a few points on fifty samples carry wide error bars they don't report.
19:25Bella: So the steelman is: the exact-recovery framing is cleaner than what's running in practice, the evaluation has a degree of freedom pointed at the benchmark, and the small sets make the sharpest numbers noisier than they look.
19:40Tyler: That's the honest version. I don't think it sinks the paper — the mechanism is sound and the flight example is real signal, not a fluke. But "free lunch that beats everything" should be "promising free signal that beats trained baselines on average, in a regime we can't fully verify." That's still a strong result. It's just not magic.
20:02Bella: And I'll concede the part I can't wave away — the optimality assumption is structurally unfalsifiable with closed training configs, and they say so. The cleanest version of this story requires labs to release more than they do. Which is actually the bigger idea to end on. The real result here isn't a new scoring method. It's a reframing: RL post-training isn't only "make the model better." It's also, secretly, "build a step-level evaluator for free." The artifact labs produce and mostly discard — the base-and-trained checkpoint pair — has value nobody designed it to have.
20:39Tyler: And that only pays off if the artifacts are public. The method physically cannot run if you can't see the reference model. So this lands as a quiet argument for open development: the byproducts you keep confidential might be the most useful thing you made — like a budgeting record you locked in a drawer and never opened.
21:00Bella: So here's the question for you. If a free step-level grader really is falling out of every RL run, should the field lean into that — squeeze more out of the checkpoints we already produce — or is leaning on a signal that's only exact for an optimal policy we can't verify a shortcut that'll bite us? Tell us which side you land on.
21:21Tyler: The full annotated version of this episode is on paperdive.ai — every technical term tap-to-define, with the related papers, like the DPO line this builds on, grouped by theme, plus our weekly and monthly roundups.
21:36Bella: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Tyler and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is "Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents," published June 24th, 2026; we recorded this the next day, June 25th.
21:59Tyler: The grader was in the drawer the whole time. The trick is knowing it's there — and being honest about how clearly you can actually read it.