All episodes
Episode 009 · May 02, 2026 · 23 min

How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Limozin, Durech, Hoefler et al.

LLM Post-training
AI Papers: A Deep Dive — Episode 009: How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers — cover art
paperdive.ai
Ep. 009
How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers
0:00
23 min
Paper
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
Venue
arXiv:2604.23747
Year
2026
Read the paper
arxiv.org/abs/2604.23747
Also available on
Apple Podcasts Spotify

An ETH Zurich group sat down to reproduce the hot new mixed-policy methods for training — and found their plain baseline beating the published baselines by five points. Pulling the thread led to two silent bugs in widely-used training libraries that had been deflating baselines across an entire subfield for over a year, and to an uncomfortable question about how much of recent benchmark progress is real.

What you'll take away

  • How a misplaced branch in 's -offloading code silently discarded most micro-batch during accumulation, shrinking effective training signal without any warning
  • Why a '' aggregation bug in systematically mis- updates when response lengths vary, and how it migrated from code where it was harmless
  • A clean four-number staircase that attributes a five-point baseline gap almost entirely to the optimizer bug, with the bug contributing under a point
  • Why corrected -then-RL beats every published mixed-policy method on math benchmarks — by 3.8 points on and a striking 22 points on — at roughly half the FLOPs
  • The structural lesson: when a whole subfield's baselines flow through the same library, independent replication becomes illusory, and framework diversity functions as epistemic insurance
  • Where the paper's claims have real edges — single-seed reproductions, math-only benchmarks, and the open question of whether mixed-policy methods could still help on top of a properly trained model

Chapters

  1. 00:00The reproduction that wouldn't reproduce
  2. 02:52The DeepSpeed gradient accumulation bug
  3. 05:45Why only the baselines were sick
  4. 08:38The mean-of-means loss bug
  5. 11:30The four-number staircase
  6. 14:23Corrected baselines flip the field's conclusions
  7. 17:16Where the paper's claims have edges
  8. 20:08The structural lesson about shared infrastructure

References in this episode

Also available as a plain-text transcript page.

0:00Hope: Picture this. A research group at ETH Zurich sits down to reproduce a hot new method for training . They build the comparison cleanly — same model, same data, same evaluation — and their plain-vanilla baseline keeps beating the supposedly-weaker baseline reported in the published paper. Not by a fraction of a point. By five points. Six on some benchmarks. They pull the thread, and what comes out the other end is a silent bug in a widely-used training framework that has been quietly invalidating an entire wave of published comparisons for over a year.

0:41Tyler: Posted to arXiv at the end of April, recorded about a week later. What you're hearing is AI-generated — Hope and I are AI voices from Eleven Labs, and the script comes from Anthropic's . Neither company is involved in producing the show. The paper is "-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning," from Limozin, Durech, Hoefler, Schlag, and Pyatkin at ETH Zurich and the Allen Institute. And it is, mechanically, a debugging story — but its actual subject is how shared infrastructure can quietly warp the conclusions of a whole subfield.

1:22Hope: Let me set up the fight, because the bug only matters once you see what it was distorting. The standard recipe for teaching a language model to reason — the one used last year, the one most open-source follow — has two stages. First, , where you show the model worked-out solutions and have it imitate them by token. Then reinforcement learning, where you let the model generate its own attempts, score them against the right answer, and nudge the toward attempts that worked. Two stages, in order. Boring. Stable. Effective. Over the past year, a small wave of papers pushed back on that recipe. , , , , — methods that said: don't separate the stages, blend them. Mix expert demonstrations with the model's own attempts inside a single training loop. The intuition was reasonable. Pure RL struggles when the model rarely produces a correct answer, and pure imitation just teaches imitation. So mix the signals, and you should get the best of both worlds. Each of these papers reported beating the boring baseline by a meaningful margin. The field was starting to treat mixed-policy methods as the new state of the art.

2:47Tyler: And this is where the ETH group started. They weren't trying to debunk anything. They were trying to reproduce the methods to build on them. But they ran their baseline using one of the standard frameworks, and they ran a separate SFT baseline using a different framework — same data, same model, same hyperparameters. The two should have agreed. They didn't. The first scored about 48 on average across math benchmarks. The second scored about 54. Five-and-a-half points just from switching libraries.

3:23Hope: Right — and it's worth pausing on what that gap means. These are not different methods. This is the same training procedure, on the same data, expected to produce the same model. A five-point gap from infrastructure alone is enormous. The cleaner of the two implementations — the one scoring 54 — was already beating the published mixed-policy methods. Which raised an uncomfortable question: maybe the wins those papers were claiming weren't really wins. So the team went hunting. The first bug they found is in , the library that handles memory optimization for huge models. To run a 7-billion-parameter model on academic hardware, you need tricks — and one of the standard tricks is . Your ideal batch is bigger than your GPU memory allows, so you process the batch in chunks, called micro-batches. You sum up the from each chunk. Then, after you've seen all of them, you take a single optimizer step using the accumulated total. Mathematically, this should be identical to processing the full batch at once. It's the standard workaround when you're memory-constrained.

4:42Tyler: And the bug is, it turns out it isn't identical.

4:45Hope: It's not identical, in a very specific way. has a feature called . You keep the optimizer's bookkeeping in regular system RAM instead of GPU memory, because there's just not enough VRAM. To do an update, the have to be copied from the GPU over to the . And the copy step lives inside a chunk of code that was supposed to run after every micro-batch. But there's a misplaced branch in the code structure: the copy only fires when the micro-batch counter equals zero. In other words, only the first micro-batch's gradients ever get sent over. The rest accumulate their gradients on the GPU correctly — and then, when the optimizer reaches over from the CPU, it only sees the first chunk's contribution. The other seven, or fifteen, or thirty-one chunks are silently discarded.

5:43Tyler: Here's the analogy I keep coming back to. You're trying to weigh a load that's too heavy for your scale, so you split it into eight portions and plan to add the readings. The scale reads each portion correctly. But the logbook the accountant looks at only ever records the first reading. You thought you'd weighed the whole load. You actually weighed one-eighth of it — and crucially, you don't get a warning. The numbers downstream just look like the numbers from a smaller batch. Training keeps going. Loss keeps going down. Nothing screams.

6:20Hope: That's exactly right, Tyler — and the symptom on the -norm plot is consistent. The buggy run shows substantially smaller gradient norms throughout training, because there's just less signal flowing into each step. The bug was introduced in in September of 2024, in a that was meant to be a refactor. It sat in production for over a year. And because DeepSpeed is the engine under three of the most popular libraries — , , and — every academic group running SFT through any of those libraries with turned on inherited the same silent failure.

7:03Tyler: Hope, before you move to the second bug — I want to make sure the listener has the geometry of this in their head. Because the asymmetry is what makes the whole story work.

7:14Hope: Please, set it up.

7:16Tyler: The mixed-policy methods — , , all of them — were implemented in a different framework called . Verl doesn't use for its optimizer. It uses a different memory-sharding approach, , which doesn't have this bug. So the new methods were running on healthy infrastructure. But the baselines they compared themselves against were built using or or — all of which run on top of DeepSpeed. The new methods were healthy. The baseline was sick. And every published comparison was structured around that asymmetry without anyone realizing it.

7:56Hope: It's like two runners lining up for a race, and only one of them has their shoelaces tied together. The race results say "new runner faster." But the laces were tied inside the shoe, where no one looks.

8:11Tyler: Right. And the wins were real, in the sense that the numbers in the tables were what the runs produced. They just weren't measuring what everyone thought they were measuring.

8:23Hope: Okay — second bug. Smaller in magnitude, but interesting because it's a different class of error. This one lives in itself, in how it computes the when training is distributed across multiple GPUs. Each GPU computes a local average loss across its mini-batch. Then those local averages get averaged together across GPUs. Mean of means. That's the bug.

8:49Tyler: And doesn't equal the true mean.

8:53Hope: Not when the chunks have different sizes. Here's the school analogy. You want the average test score across a whole school. The right way is to add up every student's score and divide by the total number of students. The wrong way is to compute each classroom's average and then average those classroom averages. If classes have different sizes, the small ones get over-weighted. A class of five averaging eighty percent gets the same vote as a class of fifty averaging sixty percent. The honest average across all fifty-five students is closer to sixty-two. The says seventy. It's just wrong. In , the "students" are response — the tokens the model is actually being graded on, after the prompt. And mini-batches contain wildly different numbers of response tokens, because prompts and responses are different lengths. So mean-of-means systematically mis- every step of training. The fix is unglamorous: sum the tokens, sum the losses, divide once at the end.

10:05Tyler: And the lineage of this bug is interesting. It came from code, where it didn't matter. Pretraining packs data so every batch has exactly the same number of active , and happens to equal the true mean. The same code got copy-pasted into codebases — and SFT doesn't pack the same way. So the equivalence quietly broke. There were even two separate disclosures about this same class of bug in mainstream code in late 2024 — Daniel Han at Unsloth, and the Hugging Face team, both flagged versions of it. This paper's contribution is showing it's still alive in and and that it materially shifts the SFT-for-reasoning numbers.

10:56Hope: And now we get to my favorite table in the paper — the staircase. The authors set up a controlled comparison where they isolate each bug's contribution. Start with the buggy baseline at 48.3 average. Fix only the aggregation bug — the — and you get 49.1. Less than a point of improvement. Then start over and fix only the optimizer bug, leave the loss bug in place, and you get 53.4. Five points. Then fix both, and you land at 54.0 — which lines up almost exactly with the independently-implemented baseline at 53.8. Four numbers. They tell the whole story. The optimizer bug accounts for nearly the entire gap. The loss bug is real but small. And the patched pipeline matches the clean pipeline, which closes the diagnosis.

11:53Tyler: That four-number staircase is one of the most informative experimental designs I've seen in a while. It's not flashy, but it ties each number in the headline result to a specific code change — which is the kind of attribution most ML papers gesture at and never actually deliver.

12:14Hope: And then the obvious next step. Take the corrected baseline, run a standard RL stage on top of it, and compare to the published mixed-policy methods on their own turf. On the model, the corrected SFT alone hits 52.2 — already beating at 46.3 and at 48.8. Add the RL stage, and you get 57.0. That beats the strongest mixed-policy method, , by 3.8 points.

12:45Tyler: And then there's , which is a different kind of result entirely.

12:50Hope: The story is wild. On Llama three-point-one eight B — which is a base model with weak math priors, so bootstrapping matters even more — corrected SFT alone reaches 33.9. The best published mixed-policy method on Llama, , scored 21.5. The corrected baseline beat it by twelve points before the RL stage even started. Run the full SFT-then-RL pipeline, and you land at 43.7. on Llama: 14.4. on Llama: 15.6. The gap is 22.2 points.

13:25Tyler: Twenty-two points is a number that should make anyone sit up. A pipeline bug producing a twenty-two-point swing on a benchmark family is — well, it's almost too clean. The authors give a plausible explanation: doesn't know much math out of the box, so a weak stage cripples it more dramatically than it cripples , which already has math priors baked in. Mixed-policy methods on Llama are essentially trying to do bootstrapping and refinement at the same time on a model that can't bootstrap, and the demonstration signal is too sparse to lift it. There's a figure in the paper that makes this visceral. The training reward curves for and on Llama stay below thirty percent for the entire 500-step run. They never get off the ground. The corrected SFT-then-RL starts at sixty percent and climbs from there.

14:30Hope: That image is the one I'd put on the cover. The mixed-policy methods, after 500 steps of training, still haven't reached the level the standard pipeline starts at.

14:43Tyler: And it's not just better — it's cheaper. The authors run a truncated version of their pipeline, with only fifty RL steps instead of the standard 500. Ten times shorter. That truncated pipeline still beats every mixed-policy method on the in-distribution math benchmarks. And the FLOP count is roughly half what uses, less than half what uses. So the boring recipe is faster, cheaper, and stronger.

15:11Hope: There's a nice secondary finding tucked into the paper that I want to flag, because it tells you something about how subtle these effects can be. On , prior works claimed the model couldn't follow the standard -style — the long structured one with explicit reasoning instructions. So they used a simplified prompt. The ETH group finds that this was actually an artifact of undertrained . Once SFT is done correctly, Llama follows the full prompt fine. A "model limitation" turned out to be a training pipeline limitation. Same family of error: the infrastructure was hiding something, and the field had built a small architectural workaround for what was really a plumbing issue.

15:59Tyler: Hope, this is where I want to push on the steelman, because the paper makes some strong claims and the listener should know where the edges are. There are several places where a careful reader should hesitate before generalizing too far.

16:15Hope: Go ahead.

16:15Tyler: First, this is one dataset, one benchmark family, two model sizes. All math. All on a specific 46-thousand-example training set. It's possible that mixed-policy methods have advantages that show up at larger scale, on harder problems, or in domains where the bootstrapping story breaks down differently. The authors are explicit about this and don't oversell. Second, and this one is the most important — the mixed-policy methods position themselves as single-stage alternatives to SFT-then-RL. The paper invalidates the published comparisons, which is a real and important result. But it doesn't rule out the possibility that a mixed-policy stage applied on top of a properly trained SFT model could add value. That experiment hasn't been run. The authors say so directly. So the precise claim is "the published comparisons are wrong," not "mixed-policy is fundamentally a bad idea." Third, the authors' own SFT uses tuned hyperparameters. The original baselines, in some cases, used worse ones. There's an example with where switching its SFT to 's hyperparameters jumps the baseline by five-and-a-half points and shrinks the apparent SRFT advantage from 7.6 points to 2.1. That's a different story than the bug story — it's "the baselines were sandbagged by suboptimal hyperparameter choice as well as by bugs." Fair point in the paper's favor. But it does mean a skeptic could ask how much of the corrected baseline's strength is just careful tuning that the original baselines simply didn't get. And fourth, the reproductions of mixed-policy methods are single-seed. The authors run their own SFT and SFT-then-RL with three seeds and report standard deviations, which is responsible. The LUFFY and reproductions are not. The gaps are large enough that variance is unlikely to flip the ordering, but a more defensible comparison would multi-seed both sides.

18:23Hope: All of those are fair, Tyler, and I think the authors hold up well under each of them. They acknowledge every one in the limitations section. The frame I'd offer is that this isn't a paper claiming mixed-policy methods don't work — it's a paper claiming the evidence cited in their favor doesn't actually support what it was taken to support. Those are two very different claims, and the authors are careful to make the narrower one.

18:49Tyler: Right. And the narrower claim is more interesting anyway — because the broader implication is what the paper is really after.

18:57Hope: Which is?

18:58Tyler: Two bugs in two widely-shared open-source frameworks were enough to systematically warp the conclusions of at least five published papers in the same subfield. Not because anyone was cheating. Not because of cherry-picking. Not because of bad faith. Because everyone's baselines flowed through the same broken plumbing. The authors put it bluntly: silent bugs in widely-used pipelines were sufficient to systematically deflate baselines across multiple independent studies.

19:27Hope: It's a structural failure mode. The kind that's almost impossible to catch from inside the system. Five different research groups can each "independently replicate" a result, and if they're all running through the same library, their independence is illusory at the level that matters. They're not testing whether the result is real. They're testing whether the library is consistent with itself.

19:51Tyler: There's an analogy I want to offer here, even though it's a little forced. Imagine a neighborhood where every house gets its water from the same main pipe, and the pipe has a slow contaminant leak. Every household has a water-quality gauge, and every gauge reads normal — because every gauge in the neighborhood is calibrated against samples drawn from the same contaminated source. Independent measurement, same reading, false consensus. The only way to detect the problem is to bring water in from a different pipe.

20:22Hope: And in the ML version, the "different pipe" was just a researcher who happened to use a different framework for the baseline than for the new method. That's it. The authors didn't build a new diagnostic tool. They didn't develop new theory. They re-implemented one experiment in a second library and noticed the readings disagreed.

20:44Tyler: Which is the methodological argument the paper is making, even when it's not making it explicitly. Framework diversity is a kind of epistemic insurance. If a subfield concentrates all its empirical work on a single training stack, a single bug in that stack becomes invisible consensus. The fix isn't to demand bug-free libraries — that's not realistic. The fix is to keep enough diversity in the infrastructure that disagreements between implementations show up as disagreements in numbers, which can then be investigated.

21:18Hope: And the paper closes on the implication that bites hardest. The -offloading bug doesn't only affect for reasoning. It affects any training run that uses DeepSpeed with and — which is most memory-constrained academic work. Which means the same silent failure has been operating in other subfields for over a year, on baselines that nobody re-implemented in a different framework. We don't know what those subfields looked like with a healthy baseline. The authors don't claim to know. But the question is now sitting in plain sight.

21:56Tyler: That's the takeaway I want listeners to carry. The narrow finding is that -then-RL beats mixed-policy methods for math reasoning, by margins large enough that the field's conclusions on this specific question should flip. The broader finding is that we should be more nervous about benchmark-driven progress in regimes where everyone's baseline runs through the same library. Not paralyzed — but more nervous than we have been.

22:25Hope: And a small hopeful note, Tyler. The bug was caught. Not because of formal incentives — there's no conference reward for finding a bug in that affects other people's papers — but because a research group that wanted to build on a method took the time to actually reproduce its baseline cleanly. That's the version of the field doing its job. It's slower. It's less glamorous. It produced this paper.

22:52Tyler: Show notes have a link to the paper and related materials, if this episode caught you and you want to read the actual bug diff.

23:00Hope: This is AI Papers: A Deep Dive. Thanks for listening.