All episodes

Episode 010 · May 02, 2026 · 22 min

When Reward Climbs But Reasoning Goes Generic: Diagnosing Template Collapse in Agentic RL

Wang, Gui, Jin et al.

Training Stability

AI Papers: A Deep Dive — Episode 010: When Reward Climbs But Reasoning Goes Generic: Diagnosing Template Collapse in Agentic RL — cover art

paperdive.ai

Listen

Ep. 010

When Reward Climbs But Reasoning Goes Generic: Diagnosing Template Collapse in Agentic RL

0:00

22 min

Concepts in this episode

Training Methods AI Alignment Evaluation & Benchmarks Reinforcement Learning Reasoning Collapse Chain of Thought Mutual Information Reward Variance SNR-Aware Filtering Entropy Regularization KL Divergence Rollout Sampling Agentic RL

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

RAGEN-2: Reasoning Collapse in Agentic RL

Venue

arXiv:2604.06268

Year

2026

Read the paper

arxiv.org/abs/2604.06268

Also available on

Apple Podcasts Spotify

A new paper argues that the standard health metric for RL training of language models — entropy — can't see one of its most damaging failure modes. Models can produce fluent, varied-looking reasoning that has quietly stopped depending on the input at all, and the field's go-to dial points the wrong direction. The fix is a one-line change to the training loop that, on average, uses less compute and gets better results.

What you'll take away

Why entropy conflates two independent axes of diversity — variation within a prompt and dependence on the prompt — and how that lets 'template collapse' run undetected
How a Shannon chain-rule decomposition turns the missing axis into a measurable quantity, and how 'cross-scoring' rollouts against other prompts in the batch makes it concrete
The Cauchy-Schwarz bound that mathematically caps the task gradient by the square root of reward variance — meaning low-variance prompts force regularizers to dominate the update
Why simply filtering out low-reward-variance prompts produced a 16-point absolute gain on Sokoban with PPO while cutting per-step compute by 26-41%
Where the method's gains are uneven, where the mutual-information proxy may be miscalibrated, and why the filter could risk a slow-motion exploration collapse
Why this reframes K-L penalties and entropy bonuses as regulating the wrong axis — controlling noise instead of amplifying weak task signal

Chapters

00:00A failure mode the metrics can't see
02:48Two axes of diversity, not one
05:36Cross-scoring and retrieval accuracy
08:24Why entropy points the wrong way
11:13The mechanism: low signal, fixed noise
14:01The fix: filter on reward variance
16:49Where the result holds up and where it doesn't
19:37What this changes about RL training

References in this episode

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning — The original RAGEN paper from the same group, providing the agentic RL framework
The Curse of Recursion: Training on Generated Data Makes Models Forget — The canonical model-collapse paper that the episode explicitly invokes as a cous
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — A high-profile example of the GRPO-style RL-on-reasoning pipeline whose entropy-
The Curious Case of Neural Text Degeneration — Introduces nucleus sampling, the adaptive-threshold idea the episode points to a

Full transcript

Also available as a plain-text transcript page.

0:00Jessica: Here's a strange situation. You're training a language model agent with reinforcement learning. You're watching the standard health metrics. Reward is climbing. Entropy — the measure of how varied the model's reasoning is — is holding steady. By every dial you have, things are fine. And then someone actually reads the model's chains of thought across different problems, and they're all the same. Different words. Same skeleton. The reasoning has detached from the question entirely.

0:31Tyler: And the unsettling part is that this can run for a long time without anyone noticing. The paper we're digging into is called "RAGEN-2: Reasoning Collapse in Agentic RL," posted to arXiv in early April, recorded a few weeks later. Quick note before we go further — this whole episode is AI-generated. The script is from Anthropic's Claude Opus 4.7. I'm Tyler... and that was Jessica - we are both AI voices from Eleven Labs. The producer of this show isn't affiliated with either company. With that out of the way — what makes the paper interesting is that the failure mode it identifies is provably invisible to the metrics most labs use. Your monitoring tells you the plane's fine while the autopilot has stopped reading the instruments.

1:17Jessica: That's the puzzle the paper opens with. And the move they make to crack it is, on paper, embarrassingly small — a textbook information-theory identity that's been hiding in plain sight in this literature for years. Once you write it down, the failure mode is almost forced. There has to be a regime where the model looks healthy and isn't, and someone just had to look for it.

1:41Tyler: So let's unpack that. What is entropy actually measuring, and what does it miss?

1:46Jessica: Right. Entropy in this setting is a number that tells you how varied the model's outputs are. Sample the model a few times on the same prompt, look at the distribution of what comes out — is it always the same thing, or is there spread? High entropy means high spread, which has been read as: the model is still exploring, still thinking, hasn't collapsed onto a single answer. Low entropy means the opposite. It's been the standard health check for RL training of language models for years. And the trap is that it's only one number, and diversity of reasoning actually has two independent axes that this single number conflates. Imagine you're a teacher grading a class on essays about different historical events. You measure essay quality by a single dial: vocabulary diversity. One student writes wildly varied prose — every essay flows differently, big vocabulary range. But it turns out every essay they write, regardless of the prompt, is secretly about the French Revolution. Another student writes in a steadier voice, but each essay is genuinely about the event you assigned. The vocabulary-diversity dial says the first student is doing great. What you actually wanted was two measurements. Variety within an essay. And dependence on the topic.

3:06Tyler: That's the conflation. And it maps onto the paper's frame really cleanly. There's variety within a prompt — does the model give different answers when you sample it twice on the same input? — and there's variety across prompts — does the reasoning actually look different when the prompt is different? Entropy just adds those together. The information theory underneath is just Shannon's chain rule: total entropy splits into two pieces. The within-input piece, and the cross-input piece, which is mutual information between the input and the model's output. Standard metrics see only the within-input piece. Mutual information is the missing axis.

3:47Jessica: And once you have both axes, you get a clean two-by-two of training regimes. The good corner is high on both: the model varies its reasoning within a problem and the reasoning is genuinely different across problems. The bad corner is low on both — total collapse, model just outputs the same thing always. The interesting failure is the corner where within-input diversity is high but mutual information is low. Reasoning that varies fluently — but doesn't depend on the input. The authors call that template collapse. And the entropy metric the field has been relying on cannot distinguish that corner from the everything-is-fine corner.

4:32Tyler: Which is striking. Because a huge amount of RL stability work has been calibrated on entropy.

4:38Jessica: A nice analogy I keep coming back to is a customer-service chatbot that has learned a single empathic-sounding script. "I understand your frustration, let me look into that, here's what I can do." Whether you're asking about a refund, a shipping delay, or a billing error, you get the same script with different filler. The wording varies. The response doesn't actually engage with your specific problem. From a fluency check, it looks fine. From the user's perspective, it's useless. That's the failure mode the paper is trying to detect. And the detection problem is non-trivial. You can't directly compute mutual information on token sequences — there's no closed form. So how do they actually measure it? The trick is what they call cross-scoring. Imagine you've just done a training batch — say sixty-four prompts, eight rollouts per prompt. For every reasoning trace in that batch, you ask: how plausible does this trace look under each of the sixty-four prompts? Just compute the model's likelihood of that trace conditional on each prompt. You end up with a big matrix of scores. If the reasoning is genuinely input-driven, then a trace from prompt seventeen should score way higher under prompt seventeen than under any other prompt. The model should "know" which prompt it's responding to. If template collapse has set in, the trace looks about as plausible under any prompt — because it's not actually responding to the specific prompt anymore. The gap collapses.

6:13Tyler: And from that they get a really vivid concrete diagnostic: Retrieval accuracy. Ok, it goes like this... given a reasoning trace and the batch of sixty-four prompts, can you figure out which prompt produced it, just by picking the one the model thinks fits best? In a healthy run, you can do this near-perfectly. Under collapse, your retrieval accuracy drops to chance — which with sixty-four prompts is about one-point-six percent. The model is producing fluent, varied-looking reasoning, and you cannot tell from the reasoning which question it was answering. One-point-six percent. Statistical noise.

6:55Jessica: There's also a continuous version they prefer for tracking training — a smoothed estimate of that same gap between matched score and average score, normalized for stability. They call it the mutual-information Z-score with exponential moving average. And the headline empirical result is that this proxy correlates positively with final task performance — Spearman around plus zero-point-three-nine. Entropy correlates negatively. About minus zero-point-one-four.

7:26Tyler: That's the reversal that should make people uncomfortable. Entropy doesn't just fail to predict performance. It points the wrong way. If you were tuning your training to keep entropy healthy, you might be optimizing in exactly the wrong direction. Now, fair caveat — plus zero-point-three-nine is a moderate correlation, not overwhelming. But the sign flip alone is the news. The community-standard health check is anti-correlated with the thing you actually care about.

7:59Jessica: And there's one more property of the mutual-information proxy that makes it operationally useful. It degrades before task performance does. By the time the success rate starts dropping, the proxy has been quietly falling for a while. So it's an early-warning signal — you can catch the collapse mid-formation rather than after the agent has already gotten worse. There's a small detail in the paper that I think makes the failure mode visceral. The runs that exhibit template collapse maintain near-perfect format compliance the whole time. The model is producing perfectly well-formed reasoning blocks — open-tag, content, close-tag, action — every single rollout. So you can't shortcut the diagnostic with a syntax check. The wrapping is fine. The interior has gone generic.

8:51Tyler: OK. So that's the diagnostic. The other half of the paper is — why does this even happen? What is it about RL training on language models that produces this specific failure mode? And the answer is satisfying because the math turns out to say exactly what you'd hope it would. Jessica, stay with me on this — I want to land it as one piece. The way modern RL on language models works: you don't update on a single rollout. You sample several rollouts from the same prompt, look at how the rewards differ across those rollouts, and the differences are what drive the update. If one rollout got reward zero-point-nine and another got zero-point-one, that contrast tells the model something — do more of what looked like the high one, less of the other. But what if all your rollouts on a prompt got nearly the same reward? Say they all got zero-point-five, or all got zero. Then there's no contrast. The task signal on that prompt is, in a real and quantifiable sense, nearly zero.

9:57Jessica: And meanwhile, the regularizers don't care.

10:00Tyler: Right. The regularizers — the KL divergence penalty pulling you toward a reference model, the entropy bonus keeping you from collapsing — are computed without reference to reward at all. They're input-agnostic by construction. They don't shrink when your reward signal shrinks. They stay exactly the same size. So the picture is: when reward variance on a prompt is high, the task gradient is loud, and it dominates. When reward variance on a prompt is low, the task gradient goes quiet — and the regularizers are now the entire update. The model is being pulled toward generic, input-agnostic patterns by gradients that, by design, don't know or care what the input is. The analogy I keep coming back to is tuning a radio. The station you want is faint. There's some baseline static from your radio's own circuits that never goes away. When the station is broadcasting clearly, you hear through the static. When the station goes quiet, all you hear is static — and now your tuning knob is responding to the static instead of the music. Reward variance is how loud the station is broadcasting on a given prompt. When it goes quiet, you tune to noise.

11:17Jessica: And what's nice is that the paper makes that intuition into an actual mathematical bound. Using a Cauchy-Schwarz argument — which is just a couple of lines — they show that the size of the task gradient on a given prompt is capped above by the square root of the reward variance on that prompt. It's not a tendency. It's a ceiling. Low reward variance mathematically forces the task gradient to be small. There's no escaping it with better hyperparameters.

11:49Tyler: Which is the move that turns a story into a mechanism. And then they go check the actual gradients in their training runs. They sort prompts into six buckets by reward variance. In each bucket, they measure the magnitude of the task gradient and the magnitude of the regularization gradient. The picture is exactly what the bound predicts. Task gradient norm climbs monotonically with reward variance. Regularization gradient norm is flat across all six buckets. In the lowest-variance bucket, the regularization gradient is essentially the entire update.

12:26Jessica: That's the mechanism in one sentence. Low signal, fixed noise, noise wins.

12:31Tyler: And the fix follows so directly from the diagnosis that it almost feels too simple. Rank prompts by reward variance. Throw away the prompts where rollouts all got similar rewards. Only update on the high-variance ones. They use an adaptive threshold — same idea as nucleus sampling, but applied to prompts ranked by variance instead of tokens ranked by probability. They call it SNR-aware filtering.

12:57Jessica: It's the teacher who only spends class time on problems where the students disagreed. If everyone got the same answer — right or wrong — there's nothing to learn from discussing it. The cases where rollouts converge on the same reward are the cases where you have nothing to teach the model on this update, because there's no contrast to learn from. So skip them.

13:21Tyler: And what makes the paper convincing isn't just that the filter works on average. It's the quartile ablation, which is the cleanest causal experiment in here.

13:32Jessica: I want to walk through this one carefully because it's the move that takes the argument from correlation to causation. They take a training setup. Sort all the prompts by reward variance. Divide into four quartiles — lowest-variance quartile, second, third, highest. Then train four separate models from the same starting point, each updating only on its own quartile. Same training budget. Same number of prompts. Same everything except which slice of the variance distribution you saw. Performance climbs monotonically from lowest quartile to highest. On the Sokoban environment, the lowest-variance quartile gets you about eleven percent task performance. The highest-variance quartile gets you about twenty-one percent. And the mutual information metric tracks the same way — the prompts you trained on actually shaped how input-driven the model's reasoning ended up.

14:30Tyler: That's the experiment that closes the loop. It's not just that high-variance prompts correlate with better outcomes. It's that if you literally only train on high-variance prompts, you get better outcomes. The signal really is in the variance, not in the count.

14:47Jessica: And the single most striking number in the main results table, for me, is on Sokoban with PPO. Baseline gets you to twelve-point-nine percent task success. Adding the filter takes you to twenty-eight-point-nine. That's a sixteen-point absolute jump from a one-line change to the training loop. And the filter reduces per-step compute by something like twenty-six to forty-one percent. Because you're doing fewer updates. You're literally doing less work and getting more performance.

15:20Tyler: Which is the kind of result that should make any practitioner suspicious. Free lunches don't usually exist in this field. So let me push on it. There are a few things worth flagging. The first is the modest correlation magnitudes we mentioned earlier. The mutual-information proxy correlates plus zero-point-three-nine with task performance. Entropy correlates minus zero-point-one-four. The sign flip is genuinely interesting. But plus zero-point-three-nine is not "this metric explains performance." It's "this is a meaningfully better signal than what we had." The paper at one point describes it as roughly twice as reliable as entropy, which in a comparison-of-magnitudes sense is true but understates that the proxy alone explains a fraction of the variance.

16:13Jessica: That's fair, Tyler. And the gains across the experimental matrix are uneven. The headline cells are dramatic — sixteen points on Sokoban with PPO, and on the vision-language version of FrozenLake there's a gain of nearly sixty points. But there are also cells where the filter does nothing, or slightly hurts. Some of the chwen zero-point-five-billion cells go negative. And the FrozenLake vision-language number — that big plus-fifty-nine — is off a baseline of about nineteen-and-a-half percent. Small absolute changes look enormous when the baseline is that low.

16:51Tyler: The paper doesn't dwell on the rougher cells, and a careful practitioner shouldn't read "consistently improves" as "you'll see big gains on your task." The variance across settings is large enough that you can't predict the magnitude of benefit on a new problem.

17:09Jessica: There's a more subtle worry about the diagnostic itself worth voicing too. The mutual-information proxy is a self-likelihood game. It measures whether the model itself thinks a given reasoning trace fits a given prompt. If the model's beliefs about which reasoning fits which prompt are themselves miscalibrated — which can happen in early training, or under collapse itself — the proxy could lag the true phenomenon. It's not validated against an external held-out judge. So there's a question of whether you're measuring the thing you really want, or a clever internal-consistency check.

17:49Tyler: And there's a longer-horizon worry about the fix itself. Filtering throws out data. As the model gets better, more prompts become low-variance — because the model is solving them consistently. That means the filter becomes more selective over time. The paper does show a "kept ratio" plot where the keep rate shrinks as training progresses. The optimistic read is that the filter is adapting — focusing on the prompts that still have something to teach. The pessimistic read is that this is a slow-motion exploration collapse, where the model eventually starves itself of informative signal because everything looks low-variance once it's good enough. The paper doesn't fully resolve which read is right.

18:37Jessica: And the authors are upfront about a related caveat — in environments with very high stochasticity, where rewards are inherently noisy regardless of effort, reward variance stops being a useful signal proxy. They show this on the eighty-to-one-hundred-percent noise version of FrozenLake, where the filter loses its advantage. Which I actually think is to their credit. The mechanism predicts exactly when the fix should fail, and then it fails there.

19:07Tyler: One more pushback worth voicing — the Cauchy-Schwarz bound is an upper bound. It says low reward variance can produce a weak task gradient. It doesn't say it always does. In their settings the bound looks roughly tight, which is why the empirical buckets line up with the prediction. But the theory itself doesn't force tightness.

19:29Jessica: One connection the paper flags that I find genuinely useful, Tyler — it reframes what regularizers like KL divergence and entropy bonuses are doing. The dominant view has been that these are noise-control mechanisms — they keep the model from drifting. The paper's framing is that they're operating on the wrong axis. They mostly move within-input diversity without moving cross-input dependence. They don't fix template collapse because template collapse is happening on a different axis from the one they regulate.

20:03Tyler: Which means a lot of the careful tuning of KL coefficients and entropy bonuses that the field has spent years on is — by this paper's lights — addressing one symptom while a different problem runs alongside it.

20:17Jessica: There's a connection to the broader model-collapse literature that's worth surfacing too. Work on what happens when generative models are trained recursively on their own outputs across generations. Distributions narrow. Outputs become generic. The pattern this paper describes is a cousin of that, in a different setting. Same shape: a system narrowing onto an input-agnostic mode. Different mechanism: instead of recursive training data, it's the inner dynamics of policy gradient updates under low signal.

20:50Tyler: And the move that generalizes beyond this paper is the methodological one. Take a single-number health metric. Decompose it into two information-theoretic components. Notice that one component can fail silently while the other looks fine. That move applies anywhere you've been measuring diversity of model outputs with one number. It's a useful pattern to have, even if you never train an RL agent.

21:18Jessica: For practitioners, the immediate message is small and concrete. Add the mutual-information proxy to your monitoring. It's free — it reuses rollouts you already did, no extra inference, no extra model. And try the filter. It's a one-line change to your training loop that, on average, removes compute and improves results.

21:39Tyler: And the conceptual takeaway is the one that'll stick longer. The field has spent years on noise-control. This paper points out that you also need signal-amplification — because when the task signal is weak, no amount of careful noise control prevents the noise from running the show.

21:59Jessica: This episode was produced on May second, twenty-twenty-six. The paper is from a group at Northwestern, Stanford, Microsoft, Imperial, and a few other places — first author Zihan Wang, with Manling Li as senior author and a long list of collaborators including Yejin Choi and Lee Fay-Fay.

22:18Tyler: Show notes have a link to the paper and related materials. Worth a read if this episode caught you. Thanks for listening to AI Papers: A Deep Dive.

When Reward Climbs But Reasoning Goes Generic: Diagnosing Template Collapse in Agentic RL

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes