All episodes

Episode 019 · May 06, 2026 · 26 min

When the Best Reward Model Trains the Worst Policy: Inside EvoLM

Li, Xin, Xiao et al.

LLM Post-training

AI Papers: A Deep Dive — Episode 019: When the Best Reward Model Trains the Worst Policy: Inside EvoLM — cover art

paperdive.ai

Listen

Ep. 019

When the Best Reward Model Trains the Worst Policy: Inside EvoLM

0:00

26 min

Concepts in this episode

Training Methods Evaluation & Benchmarks AI Alignment Rubric Generation Reward Model RL Post-Training Temporal Contrast LLM-as-Judge Reward Hacking Reward Overoptimization Self-Play / Self-Evolution RewardBench Supervised Fine-Tuning

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics

Venue

arXiv:2605.03871

Year

2026

Read the paper

arxiv.org/abs/2605.03871

Also available on

Apple Podcasts Spotify

A 1.7B-parameter judge, handed the right rubric, evaluates responses better than GPT-4.1 — and the rubric was written by a model training itself with no external supervisor. Even stranger: the reward model that wins the standard benchmarks produces the worst policy when you actually use it to train one. EvoLM suggests the field has been measuring reward quality with the wrong yardstick.

What you'll take away

Why defining rubric quality as 'does this make a weaker judge more accurate' turns evaluation into something you can train without humans, GPT-4, or verifiers
How temporal contrast — treating a model's older checkpoints as the 'worse' answer — bootstraps a reward signal entirely from a model's own training trajectory
The headline inversion: the scalar reward model that wins RewardBench-2 by 40 points produces a policy 9 points worse than EvoLM's rubrics when used for actual RL training
Why deliberately freezing a small, weak judge forces rubrics to become concrete checklists ('the answer is 144') rather than holistic criteria ('evaluate clarity')
Where the paper's story is thinner than the framing suggests — especially on subjective tasks and the unaudited assumption that newer checkpoints really are better than older ones
Why trained rubrics transfer across judges and domains, hinting at a future where reward signals are structured, inspectable artifacts rather than black-box scalars

Chapters

00:00The supervisor's ceiling in RL post-training
03:13Discriminative utility: defining when a rubric is good
06:27Temporal contrast and the runner-versus-past-self trick
09:41Why a deliberately weak judge is a feature
12:55The benchmark-versus-training inversion
16:09Steelmanning the skeptic
19:23Rubrics that transfer across judges and domains
22:36What this opens up

References in this episode

Constitutional AI: Harmlessness from AI Feedback — An earlier and influential approach to using model-generated criteria as a train
Scaling Laws for Reward Model Overoptimization — Gao, Schulman, and Hilton's systematic study of how scalar reward models break d
Self-Rewarding Language Models — Yuan et al.'s LLM-as-a-judge self-improvement loop, a natural counterpoint to Ev
RewardBench: Evaluating Reward Models for Language Modeling — The benchmark whose predictive validity the episode questions — worth reading to

Full transcript

Also available as a plain-text transcript page.

0:00Juniper: Here's a math problem. The perimeter of a rectangle is forty-eight. What's the largest possible area? You can probably do this — it's a square, side twelve, area one hundred forty-four. Now imagine you're writing a rubric to grade student answers. You'd probably write something like: did they apply the perimeter formula correctly, did they express area as a function of one variable, did they actually find the maximum. Five criteria, equal weight, the kind of thing a TA would draft.

0:33Tyler: A language model trained on this paper writes a different rubric. Three criteria, not five. And one criterion is worth eighty percent of the score. That criterion reads — and I'm barely paraphrasing — "the answer is one hundred forty-four, derived from a perimeter of forty-eight." The rubric has just put the answer inside the rubric.

0:55Juniper: Which sounds like cheating, until you realize who's reading the rubric. Welcome to AI Papers: A Deep Dive. The paper is "EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics," posted to arXiv yesterday — recorded a day later. Quick note before we dig in: this whole episode is AI-generated. The script is from Anthropic's Claude Opus 4.7. You're hearing me, Juniper, and my co-host Tyler — we're both AI voices from Eleven Labs. Neither company is involved in producing the show. And the reason that one-day turnaround matters is that this paper has a very specific, very surprising finding that we want to get to while it's fresh.

1:40Tyler: The surprising finding, just to put it on the table up front, is that a one-point-seven-billion-parameter model, given the right rubric, evaluates responses better than GPT-4.1. And the rubric was written by an eight-billion-parameter model that was simultaneously training itself using those very evaluations.

2:01Juniper: Right. So let's slow down and unpack what that even means, because the architecture here is not standard. The paper is from a team led by Shoo-yeh-eh Steh-lah Lee at the University of Washington, with collaborators at the Allen Institute and Penn. It's about post-training — the stage after pretraining where you fine-tune a model with reinforcement learning to make it actually useful as an assistant. And it's tackling what I'd call the supervisor's ceiling.

2:32Tyler: Which is what, exactly?

2:34Juniper: The whole RL post-training pipeline depends on something giving the model's answers a score. That something is the supervisor. And right now you have four options for who plays that role. You can pay humans to compare answers — slow, expensive, and capped by what humans can judge. You can pipe the answers through GPT-4 — a hard ceiling, because you'll never beat your teacher. You can use automatic verifiers, which only works for math and code where there's a ground truth to check against. Or you can train a scalar reward model — a separate neural network that takes an answer and spits out a number.

3:14Tyler: And the scalar reward model is the workhorse of the field. Most modern post-training pipelines have one sitting in the middle, scoring everything. The problem is that they're famously gameable. Once your policy starts producing weird, sharp outputs that the reward model has never seen, the reward model starts giving misleading scores. It's the classic reward-hacking story.

3:39Juniper: Exactly. So here's the puzzle the paper opens with. Pretrained models already contain enormous evaluative knowledge. A model that can write a good explanation can usually tell you what makes an explanation good. It knows what a buggy program looks like even if it didn't write the bug. The question is: can you extract that latent evaluative knowledge from the model and turn it into a reward signal — for that same model — without bringing in any external supervisor?

4:11Tyler: Which on its face sounds like nonsense. If a model judges its own outputs, what stops it from declaring everything wonderful? You're grading your own homework. There has to be some external anchor or the whole loop just inflates.

4:25Juniper: That's the right worry, and the paper's bet is that the worry dissolves if you split evaluation into two pieces. There's *what to measure* — that's a rubric — and there's *applying the criteria to a specific answer* — that's a judge. If you keep those two roles separate, and crucially, if you make the judge small and frozen and dumb, suddenly there's a falsifiable question you can train against.

4:50Tyler: That's the move I want you to land, Juniper, because this is the conceptual heart of the paper.

4:56Juniper: Okay. So picture the teaching assistant analogy. You have a junior grader — someone who hasn't taken the class, doesn't really understand the material. Your job is to write a rubric so good that, when you hand it to that junior grader along with two essays, they reliably mark the better essay higher. That's the whole game. A rubric is *good* if it makes a less-capable judge more accurate at telling good responses from bad ones. The paper calls this discriminative utility, and it's the entire definition of rubric quality.

5:30Tyler: And the elegant bit — this is what made me sit up — is that "rubric quality" defined this way is *measurable*. You don't need to ask a human whether the rubric is well-written. You just hand it to the small judge along with a known preference pair, and check whether the judge ranks the preferred answer higher. The rubric either makes the judge more accurate or it doesn't. There's a number at the end.

5:55Juniper: Right. Now, that requires having a known preference pair to begin with — you need to know which answer is supposed to be the better one, before you can check whether the rubric helped the judge identify it. And that's where the second clever move comes in.

6:12Tyler: Which is where the model starts grading itself across time.

6:16Juniper: Exactly. The trick is called temporal contrast. As the model trains, you save its outputs at every checkpoint, tagged with the training step. To make a preference pair, you take a response the model produced just now, and you pair it against a response it produced twenty steps ago, or fifty steps ago, or a hundred. And you label the newer one as preferred.

6:40Tyler: The runner-versus-their-past-self analogy is the right one here. You don't need a coach to tell you a seven-minute mile is better than an eight-minute mile. The timer does it. EvoLM treats the checkpoint from a hundred steps ago as the eight-minute version of itself, and assumes the current version ran a bit better. No human, no GPT-4, no verifier. Just temporal trajectory.

7:06Juniper: And there's a beautiful curriculum effect baked into this. Early in training, the gap between current and old responses is huge — easy preference pairs, easy rubrics. As training progresses, the gap narrows. The model is improving more slowly, the responses look more alike, and the rubrics have to get sharper to discriminate between them. The training task naturally gets harder as the model gets better.

7:34Tyler: I want to flag the load-bearing assumption here, though, Juniper, because this is the spot where a skeptic would push hardest. The whole training signal rests on "later is better than earlier." Which is plausible on average. But neural networks regress on specific behaviors all the time. They drift in style. They overfit to whatever the rubric currently rewards. The paper doesn't really audit how often the temporal-contrast preference is actually wrong.

8:05Juniper: That's fair. The authors are reasonably honest about it — they acknowledge the dependence — but they don't measure the noise floor. We'll come back to that. Let me finish the architectural picture first. So put the pieces together. You have one eight-billion-parameter model. It plays two roles, swapping between them with different prompts. As the policy, it generates answers. As the rubric generator, it writes evaluation criteria. There's a separate, small, frozen one-point-seven-billion-parameter judge that applies the rubric and produces a number — that number is the reward.

8:44Tyler: And training alternates. Phase one, you freeze the rubric generator and train the policy using rubric-conditioned scores as the reward. Phase two, you freeze the policy and train the rubric generator on preference pairs from the policy's recent history. They swap every fifty steps. The judge stays frozen the whole time.

9:06Juniper: And the judge being frozen is not a budget decision, Tyler — it's a design constraint. This is the part I find most clever.

9:14Tyler: Walk through why the constraint matters, because it's not obvious that a deliberately weak judge is a feature rather than a bug.

9:23Juniper: If the judge is small and can't actually understand most problems on its own, then the only way for a rubric to score well is to be concrete enough that even a dim grader can apply it reliably. A rubric like "evaluate clarity" is useless to a small judge — there's no semantic toehold. A rubric like "the answer should contain the value one hundred forty-four" is something a one-point-seven-B model can pattern-match in its sleep.

9:52Tyler: So freezing the judge is how you force the rubric generator to produce instructions that work in any kitchen — even one with no chef. You're handicapping the cook on purpose, so the recipe writer has to write something executable.

10:08Juniper: That's exactly it. And now the rectangle example clicks. Let me come back to it. The prompted version of the model — just asked, cold, "write a rubric for this perimeter problem" — produces what you'd expect a thoughtful TA to write. Five criteria, twenty percent each. Apply the perimeter formula correctly. Express area as a function of one variable. Find the critical point. Verify it's a maximum. State the answer. All sensible. All requiring the grader to actually understand the problem. After training, the rubric collapses to three criteria with very lopsided weights. Eighty percent on "the answer is one hundred forty-four." Fifteen percent on "uses the perimeter and area formulas correctly." Five percent on "provides a logical explanation." The rubric has reorganized itself around what a small judge can actually verify.

11:06Tyler: And when you look at this aggregated across the dataset, the trend is striking. In the prompted baseline, about twenty-two percent of criteria are what the paper calls label-only — short abstract things like "Clarity" or "Coherence." In the trained rubrics, that drops to zero-point-three percent. Functionally gone. Meanwhile criteria that embed specific expected values — actual numbers, named entities, particular constraints — nearly triple, from seven percent to nineteen.

11:40Juniper: And there's a quote from the paper that I think captures this beautifully. They write that the common effect across all the trained rubrics is "moving evaluation from holistic semantic judgment, which small judges perform unreliably, to pattern matching over concrete criteria."

12:00Tyler: Which is a kind of philosophical inversion of how we usually think about good evaluation. The folk theory is that better rubrics are more nuanced, more comprehensive, more thoughtful. EvoLM's trained rubrics are the opposite. They're more checklist than essay guideline. Evaluation as pattern matching, not as reasoning.

12:22Juniper: And it works, which is the part that should make us all a little uncomfortable.

12:27Tyler: Speaking of uncomfortable, this is where the headline result of the paper lives, and I want to set it up properly because it's the one thing every listener should walk away thinking about. So you have this trained system: an eight-B rubric generator producing rubrics, a frozen one-point-seven-B judge applying them. You compare it against the obvious baseline — a state-of-the-art eight-billion-parameter scalar reward model, which is what most production pipelines use today. You run two evaluations. First, you ask: which of these systems is more accurate at preference judgment on the standard benchmarks? Things like RewardBench-2 and JudgeBench, where you're given pairs of responses and you have to pick the better one. The scalar reward model wins, decisively. Around eighty-six percent on RewardBench-2. EvoLM's rubrics, scored by that small frozen judge, get forty-six percent.

13:27Juniper: Forty point gap. The scalar reward model is twice as good at the static benchmark.

13:32Tyler: Then you do the actual experiment. You take both systems and use them as the reward signal in an RL training run. Same policy, same data, same everything — just swap which evaluator is providing the reward. You train, you measure the resulting policy on twelve downstream tasks, and you see which trained model is actually better. The scalar reward model — the one that destroyed every static benchmark — produces the worst policy. About sixty percent average. EvoLM's rubrics produce the best policy. Just under seventy percent. A nine-point gap, in the *opposite direction* from the benchmark gap.

14:12Juniper: So the thing that looked best at evaluation was the worst at training. And the thing that looked worst at evaluation was the best at training.

14:22Tyler: That's the inversion. And it's not a small effect. It's the difference between the system that wins your benchmark and the system that loses it, with the relationship to actual training utility flipped.

14:36Juniper: Tyler, what's your read on why?

14:38Tyler: Two things, and they're both implicit in what we already said. First, reward overoptimization. The scalar reward model was trained on a fixed distribution. The moment the policy starts producing outputs the reward model hasn't seen, the reward signal starts drifting from anything meaningful. The benchmark doesn't catch this because the benchmark is also a fixed distribution. Second — and this is the deeper point — the static benchmarks measure how well an evaluator agrees with held-out human preferences on a fixed dataset. But what an RL policy actually needs from its evaluator is something different: a signal that *adapts* as the policy improves. Once your policy has stopped making the easy mistakes, you need a reward signal that can still discriminate between subtle near-misses. A scalar reward model sitting on top of a fixed dataset can't do that. Co-evolving rubrics can — the rubrics get sharper as the policy gets better, by design.

15:43Juniper: The driving instructor analogy is pretty clean here. Imagine two instructors. Instructor A scores perfectly on the written test about driving. Instructor B is mediocre on the test but has a knack for noticing what *this particular student* is doing wrong today. If you have to learn to drive, you want instructor B. The written test measures static knowledge. Teaching a moving learner is a different skill.

16:10Tyler: And the field has been measuring reward models with the written test.

16:15Juniper: Right. Which is genuinely uncomfortable, because reward model benchmarks are not some niche thing — they're how a lot of post-training research gets evaluated. The paper isn't saying the benchmarks are wrong, exactly. It's saying they're measuring a different property than the one we actually care about, and we've been assuming they were the same.

16:38Tyler: Now, let me steelman the other side, because I don't want to oversell this. There are a few places where I think the paper is doing slightly more rhetorical work than the data supports. The rectangle example we keep coming back to — embedding "one hundred forty-four" directly in the rubric — works because there's a verifiable answer. The authors are honest about this; they say rubric enrichment is most clearly observed in tasks with verifiable intermediate steps. On purely subjective tasks — style, creativity, emotional support — the qualitative story is much thinner. We don't really know what "rubric evolution" looks like there, and the mechanism plausibly doesn't bite as hard. The headline RewardBench number — that EvoLM rubrics beat GPT-4.1's prompted rubrics by about twenty-six points — is fair within the paper's framing, but it's not "EvoLM beats GPT-4.1" in any general sense. It's "EvoLM rubrics get more out of a one-point-seven-B judge than GPT-4.1's rubrics do." Which is genuinely interesting, but a headline that says "an eight-B model outperforms GPT-4.1" is technically accurate and slightly misleading.

17:50Juniper: That's a fair pull-back.

17:52Tyler: And the temporal contrast assumption — the runner-versus-past-self thing — is a real load-bearing premise that doesn't get audited. We took it for granted that newer checkpoints are better than older ones, and the whole training signal flows from that. But neural networks can regress on Tuesday because they overfit on Monday. If your "preferred" labels are wrong even, say, ten percent of the time, that noise gets baked into the rubric generator's training. The paper doesn't measure this, and I think a more cautious version of the claim would acknowledge that the temporal-contrast signal is approximate, and would actually quantify the approximation.

18:35Juniper: I'd add one more, which is what you might call the evaluation-cost problem. The paper makes a strong case that static benchmarks like RewardBench don't predict downstream policy quality. Fine. But that means within EvoLM's own design space — when they're picking hyperparameters, choosing the alternation frequency, deciding the step gap for temporal contrast — they can't trust the cheap evaluations either. The only ground truth is running the full RL pipeline and measuring the resulting policy. That's an expensive evaluation regime to build a research program on, and I think it's a real practical constraint on this whole line of work.

19:17Tyler: That's the deepest one, actually. If you can't trust your fast evaluations to predict your slow ones, your iteration cycle gets really long.

19:26Juniper: All that said, I want to come back to one more empirical result, because I think it's the most underappreciated finding in the paper. The rubrics transfer. You take a rubric generator that was trained against a one-point-seven-B chwen judge, and at inference time you swap in a totally different judge — an eight-B chwen, an OLMo model, a Mistral. The rubrics still work. In fact, a bigger judge applying the same trained rubrics gets a much higher accuracy than it would applying GPT-4.1's prompted rubrics — about a twenty-three-point gap on RewardBench-2.

20:05Tyler: And the cross-domain version is even more striking. They train the rubric generator on general-purpose Tulu data — basically nothing specialized. Then they evaluate the rubrics on HealthBench, which is medical, and on a research question-answering benchmark. The metric is: do these rubrics agree with rubrics that actual human experts wrote in medicine and research? EvoLM's rubrics, trained on general data, agree with expert human rubrics *better* than GPT-4.1 does. Fifty-eight versus fifty-three on medicine. Fifty-nine versus fifty-one on research.

20:44Juniper: Which suggests the rubric generator isn't just learning "what makes the small judge happy." It's learning something more general about evaluation structure that transfers across judges and across domains. The trained rubrics are an artifact you can take out of the system and use elsewhere.

21:04Tyler: And that's where I think the real long-term implication of this paper sits. We've been treating reward signals as opaque — a number out of a black box. EvoLM is one of a few recent papers pushing toward reward signals that are *structured*, *inspectable*, and *transferable*. The rubric is a learned object, but it's a readable learned object. You can look at it. You can edit it. You can hand it to a different evaluator. That's a different research surface than scalar reward.

21:38Juniper: There's also a bigger framing question here, which is the one about self-improvement. The pessimistic intuition is that you can't bootstrap — if a model already knew how to score answers correctly, it would already know how to produce good answers. The optimistic intuition is that knowing-how-to-evaluate and knowing-how-to-generate are different skills inside the same network, and the gap between them is real room for self-improvement. EvoLM is firmly in the optimistic camp, and I think it operationalizes that optimism more concretely than most prior work. The gap it's exploiting is specifically: the model knows enough to write criteria that distinguish good from bad answers, even when it can't reliably produce the best answer on every attempt. The rubric generator's knowledge of "what good looks like" is being squeezed out into explicit criteria, and those criteria then guide the policy toward producing better outputs.

22:41Tyler: It's a small amount of bootstrap per cycle, but it compounds. And because the judge is frozen, you can't accidentally compound errors at the evaluator end — the judge is the same one you started with.

22:54Juniper: One last thing worth flagging, because the paper is honest about it. There are some architectures where this doesn't quite work. They tested the same setup on Llama-3.1-8B and saw mode collapse on at least one benchmark — the model breaks down when it's forced to play both rubric-generator and policy roles simultaneously. OLMo-3-7B handles the dual role fine. Chwen three handles it fine. Llama doesn't. Which suggests that some part of what makes this work depends on the base model's training in a way the authors don't fully understand yet.

23:31Tyler: And the easy fix there is a two-model configuration — one model is the rubric generator, a different model is the policy. They show this works comparably to the parameter-shared version. So architecturally the method is robust, but operationally on some bases you need two copies, which doubles your memory.

23:52Juniper: Which is a fair caveat to put on the bottom of any practitioner's checklist.

23:57Tyler: My wrap-up read on this paper, then, Juniper. The clever architectural move is real. The discriminative utility framing is genuinely useful — defining rubric quality as "does this rubric make a less-capable judge more accurate" turns evaluation training into something you can actually measure and optimize against without external supervision. The empirical payoff on downstream policy quality is real, and it's not small. The headline inversion — best benchmark, worst policy — is the most important thing in the paper for the field to absorb, because it suggests our current way of measuring reward models is misaligned with what we actually need them for in RL training. The places to push back are around the temporal-contrast assumption, which isn't audited, and around the rubric-evolution story being demonstrated mostly on tasks with verifiable answers.

24:50Juniper: And mine: I think the most interesting downstream question this opens up is what happens when the evaluator and the generator are co-evolving against each other in a tighter loop. EvoLM does this in fifty-step alternations with a frozen judge. There's a whole design space underneath where the judge could also evolve, where the rubric language could evolve, where you could have multiple rubric generators producing diverse criteria — and the paper has shown that the basic loop works. That's the kind of result that opens up a research program rather than closing one off.

25:25Tyler: And that's the right note to end on — a paper whose biggest contribution might be the questions it makes available, not just the ones it answers.

25:34Juniper: That's our episode on EvoLM. The show notes have a link to the paper and related materials — worth a read if you want to see the rectangle rubric in full, because it's the kind of thing where seeing the actual JSON makes the abstract argument concrete.

25:50Tyler: Thanks for listening to AI Papers: A Deep Dive.

When the Best Reward Model Trains the Worst Policy: Inside EvoLM

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes