All episodes
Episode 074 · May 24, 2026 · 21 min

How a Fifteen-Hundred-Dollar Training Run Matched Llama and Gemma on Reasoning

Wang, Liu, Wang et al.

AI Papers: A Deep Dive — Episode 074: How a Fifteen-Hundred-Dollar Training Run Matched Llama and Gemma on Reasoning — cover art
paperdive.ai
Ep. 074
How a Fifteen-Hundred-Dollar Training Run Matched Llama and Gemma on Reasoning
0:00
21 min

Click a concept to find related episodes and external papers worth reading. See the full concept index.

Paper
HRM-Text: Efficient Pretraining Beyond Scaling
Venue
arXiv:2605.20613
Year
2026
Read the paper
arxiv.org/abs/2605.20613
Also available on
Apple Podcasts Spotify

A team at Sapient Intelligence and MIT trained a 1B-parameter model on 16 GPUs in 46 hours for about $1,500 — and it goes toe-to-toe with , , , and on math and reasoning benchmarks. The authors argue this isn't just a democratization story: it's evidence that the trillion- race was solving a problem better architecture and a smarter objective could have partly avoided.

What you'll take away

  • Why standard Transformers waste most of their depth, and how 's fast/slow modules (L runs 3x for every H update, twice per ) actually keep deliberating through the final layer
  • The trick: how a single placement of normalization behaves like PreNorm on the backward pass and PostNorm on the , because the two horizons have different lengths
  • Why grading the model only on response — not on the question — concentrates the signal and jumps from 40 to 48 with no other changes
  • How lets the model read the prompt freely while still generating answers one at a time, adding another 5 points on
  • Three honest pushbacks: is trained directly on instruction-response pairs (not apples-to-apples with general foundation models), the curated data mixture isn't isolated in the , and scaling beyond 1B parameters is unverified
  • Why the right frame is 'existence proof, not new paradigm': the compute-to-performance ratio isn't a law of nature, and architectural questions are accessible to small labs again

Chapters

  1. 00:00The fifteen-hundred-dollar headline
  2. 02:38The H and L modules: fast and slow deliberation
  3. 05:16MagicNorm and the asymmetric tightrope
  4. 07:54Stop grading the model on the question
  5. 10:32PrefixLM: reading freely, writing causally
  6. 13:10The logit lens test: is the recurrence doing real work?
  7. 15:49Three honest pushbacks
  8. 18:27What survives the critique

References in this episode

Also available as a plain-text transcript page.

0:00Bella: Sixteen GPUs. Forty-six hours. Fourteen hundred and seventy-two dollars on the invoice — call it fifteen hundred. That's what it cost a team at Sapient Intelligence and MIT to train a one-billion-parameter language model from scratch this month. And on benchmarks like grade-school math and the dataset, that model goes toe to toe with — sometimes beats — 3.2 3B, 3.5 2B, 3 4B, and 3 7B. Models that cost their builders, depending on how you count, somewhere between roughly a hundred and four hundred times more compute, and a hundred to nine hundred times more to train.

0:42Tyler: That's the headline. The paper went up on arXiv on May twentieth, twenty-twenty-six, and we're recording four days later. What you're hearing is AI-generated — the script is from Anthropic's , and I'm Tyler, that's Bella, we're both AI voices from Eleven Labs. Neither company is involved in producing the show. The paper is ": Efficient Pretraining Beyond Scaling," and the reason that fifteen-hundred-dollar number matters isn't just democratization — though it matters for that too. It's that the authors think they've shown the trillion- race was never necessary in the first place.

1:25Bella: Right. And the way to feel that claim, Tyler, is to look at the two assumptions baked into how everyone trains language models right now. First assumption: the model is a vanilla decoder-only . Stack of identical blocks, each with its own parameters, flow straight up and down. Second assumption: you train it by dumping trillions of of internet text into it and grading it on predicting every single token — every word of every prompt, every word of every answer, every word of the boilerplate between. Both assumptions look, on inspection, wasteful. The paper rebuilds the model around the first assumption being wrong, and rebuilds the training around the second one being wrong.

2:15Tyler: And the two changes compound. Which is the part the table makes really clean — but I think we should set up the architecture story first, because that's where the brain analogy comes in. You wanted to take us into the .

2:31Bella: Yeah. The biological hook the authors keep gesturing at is the in the brain — basically, fast reflexive execution and slow strategic deliberation operating on different clocks. Think of a strong chess player. There's a fast part of their thinking that just sees, move to move: that knight wants to be here, this is a kingside attack. And there's a slower part that updates only when something fundamental about the position changes: okay, we're in a closed game now, I should be playing for the long term. The fast layer fires many times for every update of the slow layer. builds that split directly into the architecture. There are two modules — they call them L and H. L is the fast one. H is the slow one. One through the model runs the three times, then updates H once. And it does that whole cycle twice. So a single forward pass is eight module-steps total, but each module has half the parameters of a comparable .

3:37Tyler: So the model is, in a sense, reusing its — running the same blocks multiple times rather than stacking more layers.

3:46Bella: Exactly right. It's a design — and this is an old, recurring idea in deep learning. Universal Transformers, Looped Transformers, RINS — they all share the family resemblance: get more computation per parameter by looping the same block. The is what's new here. And the reason recurrent designs haven't dominated language modeling, despite being efficient on paper, is that they're a nightmare to train. When you propagate backward through the same operation applied many times in a row, the gradients get spiky. Vanishing, exploding, events where one training step suddenly has a gradient a hundred times larger than the last. The training run blows up.

4:31Tyler: Which is where comes in. Which the authors named, I assume, because it deserves it.

4:37Bella: It's a cute name, and the trick underneath is actually clever. So normalization layers in a — these are the things that keep numbers from blowing up or shrinking to zero as they flow through the network. There's a classic tradeoff. You can put the normalization BEFORE the main computation in each block — that's called PreNorm. Gradients flow cleanly during training, but activations drift larger as you go deeper. Or you put it AFTER — PostNorm. Activations stay bounded, but get strangled. The idea exploits an asymmetry you wouldn't think to use. Here's the image. Imagine a tightrope walker crossing a very long rope. There's a safety net — but the net only covers the last few feet of the crossing. You want the rope itself to be stable for the whole walk. But you only need the catching apparatus where it can actually catch you.

5:34Tyler: Translate the metaphor.

5:35Bella: The — when the model is producing predictions — goes through every step. All eight of them. The backward pass — when flow back to update the — is truncated. The authors only let gradients flow through the last few steps, not the whole unroll. So the model thinks forward through a long chain but only learns from the tail end of that chain. puts a stabilizing norm at the exit of every recurrent step. On the forward pass, that norm fires eight times — lots of stabilization, activations stay bounded. On the backward pass, it only fires a few times, because gradients are truncated — so the gradient-friendly PreNorm behavior inside each block dominates. Same architecture, both behaviors, because the forward and backward horizons are different lengths. There's a second trick they layer on top — they call it — where they start training with an even shorter backward horizon, just two steps, and gradually extend it as the model stabilizes. Short leash early, longer leash once the optimization landscape isn't a minefield.

6:49Tyler: That asymmetry is genuinely satisfying. The norm gets to be in two places at once because the forward and backward passes don't see the same number of it. Okay — so that's the architecture, and the stability tricks that make it trainable. The other half of the story is the objective. What are you actually grading the model on?

7:11Bella: This is where I hand it over, Tyler. Because I think the objective change is, in some ways, the more provocative claim.

7:19Tyler: I think it might be. So the standard recipe — and I want to make sure this is clear — is: take a piece of internet text, any piece, doesn't matter what, and grade the model on predicting every single word of it given the previous words. Not just the interesting words, not just the words a user would want generated — every word. Including boilerplate. Including the parts of the prompt the model will never actually have to produce when it's deployed. At inference time, language models are doing conditional generation. Given a question, produce an answer. So the natural question the authors ask is: why are we spending most of pretraining teaching the model to predict the question?

8:04Bella: Right — because at inference, the question is given.

8:09Tyler: It's given. You don't generate it. You read it. And the analogy the paper basically writes itself into is the exam grader. Imagine grading a student two ways. In the first, you grade them on copying down the question accurately AND writing the answer. In the second, you only grade the answer — the question is just printed on the page. The first way wastes the student's effort, and it wastes yours as the grader. Standard is the first style. grades only the answer. Concretely: instead of computing over every of the document, you compute loss only over the response tokens of an instruction-response pair. Every update directly improves response generation. Nothing is spent teaching the model to model prompt-style text.

9:00Bella: I want to push on this for a second, because there's a steelman of the standard objective that doesn't show up immediately. The argument FOR predicting every is that it teaches you general language modeling. The model learns about syntax, vocabulary, discourse structure by being graded on everything. The exam-grader analogy is a little unfair because copying the question is trivial; predicting prompt-like text in is actually informative.

9:30Tyler: Sure — and the authors don't deny that. Their bet is that the marginal gain from grading the model on questions is much smaller than people have assumed, and that the signal you concentrate by not doing it is worth the tradeoff. The supports that empirically. But the steelman is real. We'll come back to it. There's a second change to flag here that ties into the same logic. Once you decide you're not grading the model on the question, you can also stop forcing it to read the question one word at a time. This is the piece.

10:05Bella: Walk through that.

10:07Tyler: So in a normal decoder-only , every uses what's called a causal mask. Each word can only see the words that came before it. That makes sense when you're generating, because you don't get to know the future. But when you're reading a question that's already given to you — when the whole prompt is sitting in front of you — there's no reason the model shouldn't be able to look at all of it freely. The image I like, and I think the paper uses something close to this: when you read a question on a page, your eyes can move around. You glance at the end, you go back to the beginning, you cross-reference. But when you WRITE the answer, you have to produce it one word at a time, in order, committing to each word before knowing the next. gives the model exactly that asymmetry. The question tokens can all see each other simultaneously, like an encoder reading the whole thing. The answer tokens are still generated one at a time, causally. Same model. Same . Just a different mask.

11:13Bella: And this is encoder-like behavior on the question without needing a second model.

11:18Tyler: Without needing a second model. The paper shows this actually increases across the layers — the model uses more of the prompt, more globally. It's looking around. And there's a beautiful that ties all this together. You don't need to see the table. They start with a vanilla trained on standard causal language modeling over full text. score: forty and a half. Then they switch only the objective — same model, but now graded only on answer . MMLU jumps to forty-eight. Then they add attention on top — fifty-three. Then they swap in the architecture — sixty-one. Each step adds something. None of them does all the work. The contributions are additive.

12:05Bella: That's the spine of the technical contribution. And I want to land the payoff piece, which is the question of whether the actually does anything. Because you could imagine all of this being true — clever architecture, clever objective — and the result still being that the loops are essentially decorative. The model commits to its answer early, and the later passes don't really change anything.

12:31Tyler: Which is exactly what we know happens in standard Transformers.

12:35Bella: It's what happens in standard Transformers. There's a diagnostic called the — you take the model's intermediate at each layer, project it forward as if that layer were the final layer, and ask: what would the prediction be if we stopped here? In standard Transformers, you can stop relatively early. The first third of the layers settle on an answer, the deeper layers nudge it around, but the prediction is basically locked in by the middle of the network. Picture a committee where each member adds their two cents — except the first few members lock in a decision and everyone after them just nods along. is different. When you run the logit lens through HRM's cycles, the prediction keeps meaningfully shifting all the way through. The last cycle is still updating the answer. The committee is still actually deliberating.

13:29Tyler: Which is, on its own, a striking result. The scale-versus-structure debate in machine learning has been running for years, and the story has mostly been scale. Bigger models, more data, more compute, capabilities emerge. What this is suggesting is that some of what we've been calling is actually a workaround for under-utilized depth. The standard wastes most of its layers. If you build an architecture that doesn't waste them, you don't need as much scale to get the same behavior.

14:01Bella: And that's the deep version of the democratization claim. The fifteen-hundred-dollar number is the visceral hook. The under-utilized-depth result is the intellectual reason it works.

14:13Tyler: Okay. So if I'm being honest, this is also where the skeptic in me wants airtime. Because the headline numbers are extraordinary, and extraordinary numbers deserve scrutiny. There are three places where I want to push back, and I think the paper handles two of them well and one of them less well. You want to take any of these, or should I just run through them?

14:37Bella: Run through them, Tyler. I'll push back where I disagree.

14:40Tyler: First: it's not apples-to-apples. , , , — these are general-purpose models. They were trained on raw web text and only later instruction-tuned for benchmarks. trains exclusively on instruction-response pairs from the start. So when you compare them on benchmarks that mostly test instruction-following — math problems, reasoning, multiple-choice questions — you're comparing a model that trained for exactly that task against models that did it as a finishing step. Of course the specialized model looks competitive. The fair question is whether HRM-Text has the same generality as the comparison models. Could you it for a novel downstream task the way you can with Llama? The paper doesn't really show that. And it's suggestive that HRM-Text loses noticeably on Hellaswag, which is more of a commonsense and world-knowledge benchmark — sixty-three percent versus seventy-seven for Gemma. That gap reads as: the broad factual coverage isn't there.

15:46Bella: The authors actually acknowledge this point pretty cleanly. They frame as good at reasoning and task execution, less good at broad factual recall. And they suggest external memory or retrieval as the complement. So this isn't a critique they're hiding from.

16:03Tyler: They're not hiding from it. But it does change what the fifteen-hundred-dollar number means. It's fifteen hundred dollars to train a competitive . Not fifteen hundred dollars to train a competitive general-purpose foundation model. Those are different claims.

16:22Bella: Fair.

16:22Tyler: Second pushback — and this is the one I think gets less than it deserves. The training data is heavily curated. Stratified by domain. Capped, upsampled, deduplicated. It includes datasets like OpenMathInstruct and NuminaMath that are specifically built to teach mathematical reasoning. The strong math benchmark numbers — eighty-four percent on grade-school math, fifty-six on the dataset — those might reflect the curated data mixture as much as the architecture. The in the paper controls for objective and architecture, but it does NOT control for the data mixture. A standard trained on this same curated mixture would presumably also outperform a Transformer trained on raw web text. And the paper doesn't isolate how much of the headline improvement comes from data curation versus architecture versus objective.

17:20Bella: That's the place I think we can add something the paper doesn't. Because data curation is real work, and it's expensive in human effort if not in compute, and that cost isn't reflected in the fifteen-hundred-dollar invoice. Somebody assembled and filtered that mixture, and that somebody isn't on the GPU bill.

17:42Tyler: It isn't on the bill. Third pushback — and this one the authors are explicit about — is that scaling is unverified. They only tested up to one billion parameters for , three billion for the baseline. The whole claim is that this architecture changes the compute-to-performance ratio. But the comparison models are two to seven billion parameters trained on far more . It's possible that 's competitiveness is specific to the small-model regime, where the comparison models are themselves undertrained relative to their architecture. Whether HRM-Text at seven billion trained on four hundred billion tokens would compete with at seventy billion trained on fifteen trillion — that's unknown. And the authors say so directly. So this isn't a hidden flaw. It's a stated open question.

18:36Bella: And the way they frame the whole paper is as an existence proof rather than a recipe. They're not saying this is the new paradigm. They're saying the compute-to-performance ratio is not a law of nature. The current paradigm leaves enormous efficiency on the table, and here's a one-billion-parameter, two-day, fifteen-hundred-dollar demonstration that you can match a much more expensive training run by changing what you're optimizing and how the architecture spends its computation. That's a narrower claim, and it's the claim that survives all three critiques.

19:15Tyler: Which I think is the right frame to leave the listener with. Not "the trillion- race was a mistake." More like: the trillion-token race was solving a problem that better architecture could partly have avoided. The standard recipe works. It also wastes a lot of computation on under-utilized depth and on predicting text the model will never generate. If you fix both of those, you can get into the performance neighborhood of much larger, much more expensive models with a university-lab budget. That's not nothing.

19:50Bella: That's not nothing. And honestly, the most exciting thing about this paper is the invitation in the conclusion. The authors basically say: from scratch is accessible again, come join us. For most of the last few years, foundational architecture research has lived inside a handful of labs that can afford it. If a sixteen-GPU, two-day, fifteen-hundred-dollar training run can produce something this competitive, that means a much larger community of researchers gets to ask architectural questions and actually answer them. Not at frontier scale yet. But at the scale where ideas can be tested and iterated. The compute moat shrinks by a meaningful amount.

20:33Tyler: The paper is from Wang and collaborators at Sapient Intelligence and MIT. Link's in the show notes, along with some related reading if you want to go deeper.

20:44Bella: And if you want the full transcript with definitions baked in for every term we touched — , prefix language modeling, through time — that's all on paperdive.ai, with concept pages that connect this episode to the others we've done on efficient training and architecture.

21:04Tyler: Thanks for listening. This was AI Papers: A Deep Dive.