All episodes
Episode 019 · May 06, 2026 · 26 min

When the Best Reward Model Trains the Worst Policy: Inside EvoLM

Li, Xin, Xiao et al.

LLM Post-training
AI Papers: A Deep Dive — Episode 019: When the Best Reward Model Trains the Worst Policy: Inside EvoLM — cover art
paperdive.ai
Ep. 019
When the Best Reward Model Trains the Worst Policy: Inside EvoLM
0:00
26 min
Paper
EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics
Venue
arXiv:2605.03871
Year
2026
Read the paper
arxiv.org/abs/2605.03871
Also available on
Apple Podcasts Spotify

A 1.7B-parameter judge, handed the right , evaluates responses better than — and the rubric was written by a model training itself with no external supervisor. Even stranger: the that wins the standard benchmarks produces the worst policy when you actually use it to train one. suggests the field has been measuring reward quality with the wrong yardstick.

What you'll take away

  • Why defining quality as 'does this make a weaker judge more accurate' turns evaluation into something you can train without humans, , or
  • How — treating a model's older checkpoints as the 'worse' answer — bootstraps a reward signal entirely from a model's own training
  • The headline inversion: the scalar that wins by 40 points produces a policy 9 points worse than 's when used for actual RL training
  • Why deliberately freezing a small, weak judge forces to become concrete checklists ('the answer is 144') rather than holistic criteria ('evaluate clarity')
  • Where the paper's story is thinner than the framing suggests — especially on subjective tasks and the unaudited assumption that newer checkpoints really are better than older ones
  • Why trained transfer across judges and domains, hinting at a future where reward signals are structured, inspectable artifacts rather than black-box scalars

Chapters

  1. 00:00The supervisor's ceiling in RL post-training
  2. 03:13Discriminative utility: defining when a rubric is good
  3. 06:27Temporal contrast and the runner-versus-past-self trick
  4. 09:41Why a deliberately weak judge is a feature
  5. 12:55The benchmark-versus-training inversion
  6. 16:09Steelmanning the skeptic
  7. 19:23Rubrics that transfer across judges and domains
  8. 22:36What this opens up

References in this episode

Also available as a plain-text transcript page.

0:00Juniper: Here's a math problem. The perimeter of a rectangle is forty-eight. What's the largest possible area? You can probably do this — it's a square, side twelve, area one hundred forty-four. Now imagine you're writing a to grade student answers. You'd probably write something like: did they apply the perimeter formula correctly, did they express area as a function of one variable, did they actually find the maximum. Five criteria, equal , the kind of thing a TA would draft.

0:33Tyler: A language model trained on this paper writes a different . Three criteria, not five. And one criterion is worth eighty percent of the score. That criterion reads — and I'm barely paraphrasing — "the answer is one hundred forty-four, derived from a perimeter of forty-eight." The rubric has just put the answer inside the rubric.

0:55Juniper: Which sounds like cheating, until you realize who's reading the . Welcome to AI Papers: A Deep Dive. The paper is ": Self-Evolving Language Models through Co-Evolved Discriminative Rubrics," posted to arXiv yesterday — recorded a day later. Quick note before we dig in: this whole episode is AI-generated. The script is from Anthropic's . You're hearing me, Juniper, and my co-host Tyler — we're both AI voices from Eleven Labs. Neither company is involved in producing the show. And the reason that one-day turnaround matters is that this paper has a very specific, very surprising finding that we want to get to while it's fresh.

1:40Tyler: The surprising finding, just to put it on the table up front, is that a one-point-seven-billion-parameter model, given the right , evaluates responses better than . And the rubric was written by an eight-billion-parameter model that was simultaneously training itself using those very evaluations.

2:01Juniper: Right. So let's slow down and unpack what that even means, because the architecture here is not standard. The paper is from a team led by Shoo-yeh-eh Steh-lah Lee at the University of Washington, with collaborators at the Allen Institute and Penn. It's about post-training — the stage after where you a model with reinforcement learning to make it actually useful as an assistant. And it's tackling what I'd call the supervisor's ceiling.

2:32Tyler: Which is what, exactly?

2:34Juniper: The whole RL post-training pipeline depends on something giving the model's answers a score. That something is the supervisor. And right now you have four options for who plays that role. You can pay humans to compare answers — slow, expensive, and capped by what humans can judge. You can pipe the answers through — a hard ceiling, because you'll never beat your teacher. You can use automatic , which only works for math and code where there's a ground truth to check against. Or you can train a scalar — a separate neural network that takes an answer and spits out a number.

3:14Tyler: And the scalar is the workhorse of the field. Most modern post-training pipelines have one sitting in the middle, scoring everything. The problem is that they're famously gameable. Once your policy starts producing weird, sharp outputs that the reward model has never seen, the reward model starts giving misleading scores. It's the classic story.

3:39Juniper: Exactly. So here's the puzzle the paper opens with. Pretrained models already contain enormous evaluative knowledge. A model that can write a good explanation can usually tell you what makes an explanation good. It knows what a buggy program looks like even if it didn't write the bug. The question is: can you extract that latent evaluative knowledge from the model and turn it into a reward signal — for that same model — without bringing in any external supervisor?

4:11Tyler: Which on its face sounds like nonsense. If a model judges its own outputs, what stops it from declaring everything wonderful? You're grading your own homework. There has to be some external anchor or the whole loop just inflates.

4:25Juniper: That's the right worry, and the paper's bet is that the worry dissolves if you split evaluation into two pieces. There's *what to measure* — that's a — and there's *applying the criteria to a specific answer* — that's a judge. If you keep those two roles separate, and crucially, if you make the judge small and frozen and dumb, suddenly there's a falsifiable question you can train against.

4:50Tyler: That's the move I want you to land, Juniper, because this is the conceptual heart of the paper.

4:56Juniper: Okay. So picture the teaching assistant analogy. You have a junior grader — someone who hasn't taken the class, doesn't really understand the material. Your job is to write a so good that, when you hand it to that junior grader along with two essays, they reliably mark the better essay higher. That's the whole game. A rubric is *good* if it makes a less-capable judge more accurate at telling good responses from bad ones. The paper calls this , and it's the entire definition of rubric quality.

5:30Tyler: And the elegant bit — this is what made me sit up — is that " quality" defined this way is *measurable*. You don't need to ask a human whether the rubric is well-written. You just hand it to the small judge along with a known preference pair, and check whether the judge ranks the preferred answer higher. The rubric either makes the judge more accurate or it doesn't. There's a number at the end.

5:55Juniper: Right. Now, that requires having a known preference pair to begin with — you need to know which answer is supposed to be the better one, before you can check whether the helped the judge identify it. And that's where the second clever move comes in.

6:12Tyler: Which is where the model starts grading itself across time.

6:16Juniper: Exactly. The trick is called . As the model trains, you save its outputs at every checkpoint, tagged with the training step. To make a preference pair, you take a response the model produced just now, and you pair it against a response it produced twenty steps ago, or fifty steps ago, or a hundred. And you label the newer one as preferred.

6:40Tyler: The runner-versus-their-past-self analogy is the right one here. You don't need a coach to tell you a seven-minute mile is better than an eight-minute mile. The timer does it. treats the checkpoint from a hundred steps ago as the eight-minute version of itself, and assumes the current version ran a bit better. No human, no , no . Just temporal .

7:06Juniper: And there's a beautiful curriculum effect baked into this. Early in training, the gap between current and old responses is huge — easy preference pairs, easy . As training progresses, the gap narrows. The model is improving more slowly, the responses look more alike, and the rubrics have to get sharper to discriminate between them. The training task naturally gets harder as the model gets better.

7:34Tyler: I want to flag the load-bearing assumption here, though, Juniper, because this is the spot where a skeptic would push hardest. The whole training signal rests on "later is better than earlier." Which is plausible on average. But neural networks regress on specific behaviors all the time. They drift in style. They overfit to whatever the currently rewards. The paper doesn't really audit how often the temporal-contrast preference is actually wrong.

8:05Juniper: That's fair. The authors are reasonably honest about it — they acknowledge the dependence — but they don't measure the noise floor. We'll come back to that. Let me finish the architectural picture first. So put the pieces together. You have one eight-billion-parameter model. It plays two roles, swapping between them with different prompts. As the policy, it generates answers. As the generator, it writes evaluation criteria. There's a separate, small, frozen one-point-seven-billion-parameter judge that applies the rubric and produces a number — that number is the reward.

8:44Tyler: And training alternates. Phase one, you freeze the generator and train the policy using rubric-conditioned scores as the reward. Phase two, you freeze the policy and train the rubric generator on preference pairs from the policy's recent history. They swap every fifty steps. The judge stays frozen the whole time.

9:06Juniper: And the judge being frozen is not a budget decision, Tyler — it's a design constraint. This is the part I find most clever.

9:14Tyler: Walk through why the constraint matters, because it's not obvious that a deliberately weak judge is a feature rather than a bug.

9:23Juniper: If the judge is small and can't actually understand most problems on its own, then the only way for a to score well is to be concrete enough that even a dim grader can apply it reliably. A rubric like "evaluate clarity" is useless to a small judge — there's no semantic toehold. A rubric like "the answer should contain the value one hundred forty-four" is something a one-point-seven-B model can pattern-match in its sleep.

9:52Tyler: So freezing the judge is how you force the generator to produce instructions that work in any kitchen — even one with no chef. You're handicapping the cook on purpose, so the recipe writer has to write something executable.

10:08Juniper: That's exactly it. And now the rectangle example clicks. Let me come back to it. The prompted version of the model — just asked, cold, "write a for this perimeter problem" — produces what you'd expect a thoughtful TA to write. Five criteria, twenty percent each. Apply the perimeter formula correctly. Express area as a function of one variable. Find the critical point. Verify it's a maximum. State the answer. All sensible. All requiring the grader to actually understand the problem. After training, the rubric collapses to three criteria with very lopsided . Eighty percent on "the answer is one hundred forty-four." Fifteen percent on "uses the perimeter and area formulas correctly." Five percent on "provides a logical explanation." The rubric has reorganized itself around what a small judge can actually verify.

11:06Tyler: And when you look at this aggregated across the dataset, the trend is striking. In the prompted baseline, about twenty-two percent of criteria are what the paper calls label-only — short abstract things like "Clarity" or "Coherence." In the trained , that drops to zero-point-three percent. Functionally gone. Meanwhile criteria that embed specific expected values — actual numbers, named entities, particular constraints — nearly triple, from seven percent to nineteen.

11:40Juniper: And there's a quote from the paper that I think captures this beautifully. They write that the common effect across all the trained is "moving evaluation from holistic semantic judgment, which small judges perform unreliably, to pattern matching over concrete criteria."

12:00Tyler: Which is a kind of philosophical inversion of how we usually think about good evaluation. The folk theory is that better are more nuanced, more comprehensive, more thoughtful. 's trained rubrics are the opposite. They're more checklist than essay guideline. Evaluation as pattern matching, not as reasoning.

12:22Juniper: And it works, which is the part that should make us all a little uncomfortable.

12:27Tyler: Speaking of uncomfortable, this is where the headline result of the paper lives, and I want to set it up properly because it's the one thing every listener should walk away thinking about. So you have this trained system: an eight-B generator producing rubrics, a frozen one-point-seven-B judge applying them. You compare it against the obvious baseline — a state-of-the-art eight-billion-parameter scalar , which is what most production pipelines use today. You run two evaluations. First, you ask: which of these systems is more accurate at preference judgment on the standard benchmarks? Things like and JudgeBench, where you're given pairs of responses and you have to pick the better one. The scalar reward model wins, decisively. Around eighty-six percent on RewardBench-2. 's rubrics, scored by that small frozen judge, get forty-six percent.

13:27Juniper: Forty point gap. The scalar is twice as good at the static benchmark.

13:32Tyler: Then you do the actual experiment. You take both systems and use them as the reward signal in an RL training run. Same policy, same data, same everything — just swap which is providing the reward. You train, you measure the resulting policy on twelve downstream tasks, and you see which trained model is actually better. The scalar — the one that destroyed every static benchmark — produces the worst policy. About sixty percent average. 's produce the best policy. Just under seventy percent. A nine-point gap, in the *opposite direction* from the benchmark gap.

14:12Juniper: So the thing that looked best at evaluation was the worst at training. And the thing that looked worst at evaluation was the best at training.

14:22Tyler: That's the inversion. And it's not a small effect. It's the difference between the system that wins your benchmark and the system that loses it, with the relationship to actual training utility flipped.

14:36Juniper: Tyler, what's your read on why?

14:38Tyler: Two things, and they're both implicit in what we already said. First, reward overoptimization. The scalar was trained on a fixed distribution. The moment the policy starts producing outputs the reward model hasn't seen, the reward signal starts drifting from anything meaningful. The benchmark doesn't catch this because the benchmark is also a fixed distribution. Second — and this is the deeper point — the static benchmarks measure how well an agrees with held-out human preferences on a fixed dataset. But what an RL policy actually needs from its evaluator is something different: a signal that *adapts* as the policy improves. Once your policy has stopped making the easy mistakes, you need a reward signal that can still discriminate between subtle near-misses. A scalar reward model sitting on top of a fixed dataset can't do that. Co-evolving can — the rubrics get sharper as the policy gets better, by design.

15:43Juniper: The driving instructor analogy is pretty clean here. Imagine two instructors. Instructor A scores perfectly on the written test about driving. Instructor B is mediocre on the test but has a knack for noticing what *this particular student* is doing wrong today. If you have to learn to drive, you want instructor B. The written test measures static knowledge. Teaching a moving learner is a different skill.

16:10Tyler: And the field has been measuring with the written test.

16:15Juniper: Right. Which is genuinely uncomfortable, because benchmarks are not some niche thing — they're how a lot of post-training research gets evaluated. The paper isn't saying the benchmarks are wrong, exactly. It's saying they're measuring a different property than the one we actually care about, and we've been assuming they were the same.

16:38Tyler: Now, let me steelman the other side, because I don't want to oversell this. There are a few places where I think the paper is doing slightly more rhetorical work than the data supports. The rectangle example we keep coming back to — "one hundred forty-four" directly in the — works because there's a verifiable answer. The authors are honest about this; they say rubric enrichment is most clearly observed in tasks with verifiable intermediate steps. On purely subjective tasks — style, creativity, emotional support — the qualitative story is much thinner. We don't really know what "rubric evolution" looks like there, and the mechanism plausibly doesn't bite as hard. The headline number — that rubrics beat 's prompted rubrics by about twenty-six points — is fair within the paper's framing, but it's not "EvoLM beats GPT-4.1" in any general sense. It's "EvoLM rubrics get more out of a one-point-seven-B judge than GPT-4.1's rubrics do." Which is genuinely interesting, but a headline that says "an eight-B model outperforms GPT-4.1" is technically accurate and slightly misleading.

17:50Juniper: That's a fair pull-back.

17:52Tyler: And the assumption — the runner-versus-past-self thing — is a real load-bearing premise that doesn't get audited. We took it for granted that newer checkpoints are better than older ones, and the whole training signal flows from that. But neural networks can regress on Tuesday because they overfit on Monday. If your "preferred" labels are wrong even, say, ten percent of the time, that noise gets baked into the generator's training. The paper doesn't measure this, and I think a more cautious version of the claim would acknowledge that the temporal-contrast signal is approximate, and would actually quantify the approximation.

18:35Juniper: I'd add one more, which is what you might call the evaluation-cost problem. The paper makes a strong case that static benchmarks like don't predict downstream policy quality. Fine. But that means within 's own design space — when they're picking hyperparameters, choosing the alternation frequency, deciding the step gap for — they can't trust the cheap evaluations either. The only ground truth is running the full RL pipeline and measuring the resulting policy. That's an expensive evaluation regime to build a research program on, and I think it's a real practical constraint on this whole line of work.

19:17Tyler: That's the deepest one, actually. If you can't trust your fast evaluations to predict your slow ones, your iteration cycle gets really long.

19:26Juniper: All that said, I want to come back to one more empirical result, because I think it's the most underappreciated finding in the paper. The transfer. You take a rubric generator that was trained against a one-point-seven-B judge, and at inference time you swap in a totally different judge — an eight-B chwen, an model, a . The rubrics still work. In fact, a bigger judge applying the same trained rubrics gets a much higher accuracy than it would applying 's prompted rubrics — about a twenty-three-point gap on .

20:05Tyler: And the cross-domain version is even more striking. They train the generator on general-purpose Tulu data — basically nothing specialized. Then they evaluate the rubrics on , which is medical, and on a research question-answering benchmark. The metric is: do these rubrics agree with rubrics that actual human experts wrote in medicine and research? 's rubrics, trained on general data, agree with expert human rubrics *better* than does. Fifty-eight versus fifty-three on medicine. Fifty-nine versus fifty-one on research.

20:44Juniper: Which suggests the generator isn't just learning "what makes the small judge happy." It's learning something more general about evaluation structure that transfers across judges and across domains. The trained rubrics are an artifact you can take out of the system and use elsewhere.

21:04Tyler: And that's where I think the real long-term implication of this paper sits. We've been treating reward signals as opaque — a number out of a . is one of a few recent papers pushing toward reward signals that are *structured*, *inspectable*, and *transferable*. The is a learned object, but it's a readable learned object. You can look at it. You can edit it. You can hand it to a different . That's a different research surface than scalar reward.

21:38Juniper: There's also a bigger framing question here, which is the one about self-improvement. The pessimistic intuition is that you can't bootstrap — if a model already knew how to score answers correctly, it would already know how to produce good answers. The optimistic intuition is that knowing-how-to-evaluate and knowing-how-to-generate are different skills inside the same network, and the gap between them is real room for self-improvement. is firmly in the optimistic camp, and I think it operationalizes that optimism more concretely than most prior work. The gap it's exploiting is specifically: the model knows enough to write criteria that distinguish good from bad answers, even when it can't reliably produce the best answer on every attempt. The generator's knowledge of "what good looks like" is being squeezed out into explicit criteria, and those criteria then guide the policy toward producing better outputs.

22:41Tyler: It's a small amount of bootstrap per cycle, but it compounds. And because the judge is frozen, you can't accidentally compound errors at the end — the judge is the same one you started with.

22:54Juniper: One last thing worth flagging, because the paper is honest about it. There are some architectures where this doesn't quite work. They tested the same setup on -8B and saw on at least one benchmark — the model breaks down when it's forced to play both -generator and policy roles simultaneously. -7B handles the dual role fine. Chwen three handles it fine. Llama doesn't. Which suggests that some part of what makes this work depends on the base model's training in a way the authors don't fully understand yet.

23:31Tyler: And the easy fix there is a two-model configuration — one model is the generator, a different model is the policy. They show this works comparably to the parameter-shared version. So architecturally the method is robust, but operationally on some bases you need two copies, which doubles your memory.

23:52Juniper: Which is a fair caveat to put on the bottom of any practitioner's checklist.

23:57Tyler: My wrap-up read on this paper, then, Juniper. The clever architectural move is real. The framing is genuinely useful — defining quality as "does this rubric make a less-capable judge more accurate" turns evaluation training into something you can actually measure and optimize against without external supervision. The empirical payoff on downstream policy quality is real, and it's not small. The headline inversion — best benchmark, worst policy — is the most important thing in the paper for the field to absorb, because it suggests our current way of measuring is misaligned with what we actually need them for in RL training. The places to push back are around the temporal-contrast assumption, which isn't audited, and around the rubric-evolution story being demonstrated mostly on tasks with verifiable answers.

24:50Juniper: And mine: I think the most interesting downstream question this opens up is what happens when the and the generator are co-evolving against each other in a tighter loop. does this in fifty-step alternations with a frozen judge. There's a whole design space underneath where the judge could also evolve, where the language could evolve, where you could have multiple rubric generators producing diverse criteria — and the paper has shown that the basic loop works. That's the kind of result that opens up a research program rather than closing one off.

25:25Tyler: And that's the right note to end on — a paper whose biggest contribution might be the questions it makes available, not just the ones it answers.

25:34Juniper: That's our episode on . The show notes have a link to the paper and related materials — worth a read if you want to see the rectangle in full, because it's the kind of thing where seeing the actual makes the abstract argument concrete.

25:50Tyler: Thanks for listening to AI Papers: A Deep Dive.