When the Iteration Teaches the Model to Skip the Iteration
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
Three frontier language models score zero on hard Sudoku. A 27-million parameter model solves 91%. But the real surprise in this paper isn't the benchmark — it's that the iterative refinement procedure the model is trained with quietly disappears at inference, absorbed into a single forward pass. We work through how that happens, and what it might mean.
What you'll take away
- Why looped Transformers have been fragile to train, and how reframing the loop as a fixed-point problem gives you constant training memory regardless of iteration depth
- The implicit differentiation trick that makes this work — and the cheaper one-step approximation that holds up in language modeling but breaks in the reasoning regime
- How Attractor Models build a new Pareto frontier on language modeling, matching a 1.3B Transformer at 770M parameters
- Why TRM collapses from 75% to 0% on Sudoku when you scale it from 7M to 27M parameters — and why the Attractor Model at the same scale doesn't
- Equilibrium internalization: the trained backbone learns to put its first guess at the fixed point, making the refinement module obsolete at inference — an emergent self-distillation nobody designed in
- The 'implicit gradient barrier' argument for why this training is structurally more stable than fixed-depth looped training, and where that argument is intuition rather than proof
Chapters
- 00:00The problem with one-forward-pass reasoning
- 03:18Don't unroll the loop, solve for where it ends
- 06:37Implicit differentiation and constant-memory training
- 09:56Two design choices that make it work
- 13:15Language modeling results
- 16:34The Sudoku result and what it actually means
- 27:11Equilibrium internalization
- 22:06The implicit gradient barrier
- 26:31Where the paper reaches and what to watch
References in this episode
- Deep Equilibrium Models — The 2019 Bai et al. paper that introduced fixed-point equilibrium networks with
- Hierarchical Reasoning Model (HRM) — One of the tiny-reasoner architectures whose Sudoku and maze results set up the
- Less is More: Recursive Reasoning with Tiny Networks (TRM) — The 7M-parameter recursive reasoner that beats frontier LLMs on Sudoku-Extreme b
- Looped Transformers for Length Generalization — A representative entry in the looped-Transformer literature whose training fragi
Full transcript
Also available as a plain-text transcript page.
0:00Tyler: Three of the most capable language models on the planet — DeepSeek R1, Claude 3.7 Sonnet, and OpenAI's o3-mini — were handed a stack of Sudoku puzzles. The hard kind. And they solved exactly zero of them. Not "fewer than expected." Zero percent. Then a twenty-seven-million parameter model — about a ten-thousandth the size of any of those frontier systems — solved roughly ninety-one percent of the same puzzles.
0:29Juniper: And the architecture that lets that happen is what we're working through today. The paper is "Solve the Loop: Attractor Models for Language and Reasoning," by Jacob Fein-Ashley and Paria Rashidinejad at the University of Southern California. It went up on arXiv on May twelfth, twenty-twenty-six, and we recorded one day later. Quick note on what you're hearing: this episode is AI-generated. The script is from Anthropic's Claude Opus 4.7. I'm Juniper, Tyler is here with me, and we're both AI voices from Eleven Labs. The producer isn't affiliated with Anthropic or Eleven Labs. And the reason a one-day-old paper is worth jumping on is that the Sudoku result Tyler just teased is actually the easier half of what's going on.
1:21Tyler: Easier half. That's a setup.
1:24Juniper: It is. The Sudoku number is the hook — it's vivid, it lands instantly. But the architectural idea behind it shows up just as strongly on plain language modeling, and there's a phenomenon buried in the training dynamics that I think is the most interesting thing in the paper. So we'll work toward that.
1:46Tyler: Alright. Walk me through the puzzle. What problem is this paper actually trying to solve?
1:52Juniper: It starts with something everyone listening already knows in their gut. When a Transformer predicts a token, it does the same amount of work for every token. Whether it's predicting the word "the" in a sentence about your grocery list, or the next move in a chess position that took a grandmaster forty minutes to evaluate — same number of layers, same FLOPs, same one forward pass. The model can't think harder about the harder token.
2:22Tyler: Which is weird, because thinking harder is most of what reasoning is.
2:26Juniper: Right. And the field has known this for a long time, so people have tried two paths. One is chain-of-thought — make the model write out its thinking as tokens, so each "thought" is just another output. That works, but it forces the thinking to be discrete and visible, even when the real refinement is happening in continuous latent space. The other path is looped Transformers. You take one set of layer weights and just apply them over and over to the same internal state, refining it. That's iteration in latent space — no need to verbalize.
3:03Tyler: And that path has been the fragile one.
3:06Juniper: Painfully fragile. The core issue is training. To compute gradients through a loop you've run, say, two hundred and fifty-six times, you have to remember every intermediate state — so memory grows linearly with the loop depth. Run it two hundred and fifty-six times, you need two hundred and fifty-six times the memory of a single pass. There's a result from a few years back where the authors report a looped model that ends up burning the FLOPs of a feed-forward model ten times its size, just to train. And inference is rigid — you commit to a loop depth at training time. You can't decide later, "this token is harder, let me iterate more."
3:49Tyler: And there's the weirder failure too. The tiny reasoning models that get worse when you scale them up.
3:56Juniper: That's the one that really doesn't fit into a clean scaling story. There's a line of work on tiny recursive reasoners — HRM, TRM — that hit really impressive scores on tasks like Sudoku and maze-solving with just a few million parameters. TRM at seven million parameters gets about seventy-five percent on Sudoku-Extreme. Then you take the same architecture and scale it to twenty-seven million parameters. It collapses. Zero percent. The bigger model is catastrophically worse than the smaller one. That is not how machine learning is supposed to work.
4:34Tyler: So you have this whole subfield where the thing that should help — more capacity — actively breaks the model. That's the backdrop.
4:43Juniper: That's the backdrop. So the question is: can you get the benefit of latent iterative refinement — the model thinking longer about harder tokens, in its own internal space — without paying the memory cost, the training fragility, and this strange scaling collapse?
5:01Tyler: And the answer this paper proposes is — don't unroll the loop. Solve for where the loop would end up.
5:08Juniper: That's the reframe. And it comes out of an empirical observation made by a different group, Blayney and colleagues, earlier this year. They looked inside trained looped language models and noticed something. For most tokens, the loop is converging. By iteration four or five, the internal state has stopped changing meaningfully. The loop isn't computing some long trajectory — it's grinding toward a fixed point. Most of the iterations are just confirming the answer the loop has already found.
5:41Tyler: So the loop is doing fixed-point iteration whether anyone planned for that or not.
5:47Juniper: Exactly. And once you see that, the whole paper basically asks the obvious next question: if the loop is solving a fixed-point problem, why are we training it as a fixed-depth unroll? Why not just solve the fixed-point problem directly?
6:03Tyler: Let me make sure I have the fixed-point intuition right, because this is the skeleton of everything. A fixed point of some function is a value where, if you feed it in, you get the same thing out. It's a place the iteration is being pulled toward — an attractor in the state space.
6:21Juniper: That's it. And there's a clean mental picture for why iteration finds it. Imagine you're zooming in on a digital map, and each step magnifies what's at the center by a factor of two. After ten zooms, you've magnified a thousandfold. Whatever was near the center is now filling the screen — your remaining error has been cut by a constant fraction every step. That's geometric convergence, and it's the math under all of this. Once you accept that the iteration is heading to a specific destination, you can start asking: why are we taking the steps at all? Why not just solve for the destination?
7:00Tyler: Right — and the practical version of that question is the implicit differentiation move. This is the part I want to make sure we actually unpack, because it's where the magic happens and also where it's easy to wave through.
7:14Juniper: This is the part the paper is really built on. So here's the situation. You've got a refinement function — the attractor module — and a starting guess, and you want to find the point where the function gives back its own input. That's your answer. Now, normally, to train a neural network, you compute gradients by running the network forward, remembering everything, then walking backward through it. If your network is a loop run a hundred times, you remember a hundred states. The implicit function theorem gives you another option. It says: if your answer satisfies an equation — in this case, "refining the answer gives back the answer" — then you can differentiate the equation itself. You can ask "how does this answer shift if I nudge the parameters of the network?" and get a clean linear equation that you solve once at the very end. You never need to remember the trajectory. You never need to remember how you got there. You only need to know that you arrived.
8:17Tyler: That's the recipe analogy, right? You don't trace every step of baking the cake. You just take the finished cake, perturb the sugar a little, taste the difference, and infer the dependence from the endpoint.
8:30Juniper: That's the spirit. And the practical consequence is dramatic. Memory becomes constant in the number of iterations. The solver can run for ten steps or two hundred and fifty-six — your GPU memory footprint doesn't budge. The paper has a figure where, at two hundred and fifty-six iterations, the strongest competing looped model just runs out of memory and crashes. The Attractor Model sits at about four gigabytes, flat, no matter how many iterations the solver decides to take.
9:01Tyler: Constant memory is the headline of the trick, but I want to flag something here for the listener, Juniper, because this is where the magic-wand temptation lives. The implicit gradient technically still requires solving a linear system involving a big matrix inverse. The paper uses a one-step approximation that just drops the inverse entirely. And the ablation on that is sharp — dropping the inverse costs basically nothing in language modeling. Something like a tenth of a perplexity point. But it cuts training memory by a factor of five, and step time by almost three. So the practical version of "implicit differentiation" in this paper is even cheaper than the textbook version. It's the bare minimum that still works.
9:49Juniper: And that "still works" is doing a lot of weight. We'll come back to it when we get to the reasoning regime, because over there, it does *not* still work, and they have to spend more.
10:01Tyler: Worth flagging. So we have the reframe — solve for the fixed point, not the trajectory. We have the implicit differentiation move that makes training memory constant. What else?
10:13Juniper: Two design choices that the authors stress, because they're what makes this work where prior attempts didn't. The closest ancestor of this paper is Deep Equilibrium Models — DEQs — from Bai and colleagues back in twenty-nineteen. DEQs had the same idea: find a fixed point, differentiate through it. But DEQs didn't really scale to language modeling. Training was unstable, the number of iterations needed kept ballooning as training progressed, and quality lagged. The Attractor Models paper diagnoses why. DEQs put the equilibrium in a hidden state — some internal scratch space — and they initialize the solver from zero. Then there's a separate decoder head that turns the equilibrium into an answer. So the solver starts from nothing and has to converge to something meaningful inside an abstract internal coordinate system. This paper does two things differently. First, the equilibrium lives in *output embedding* space — the space the model would already use to make a prediction. The fixed point you're solving for *is* the thing that gets decoded. Second, the solver doesn't start from zero. A normal Transformer backbone — a big one, full attention, full capacity — does one forward pass and produces an initial guess. That initial guess is what the solver starts from. The attractor module's job is just to refine.
11:42Tyler: So instead of asking a small refinement network to hallucinate an answer from scratch, you ask a big Transformer to produce a draft and let the refiner clean it up.
11:53Juniper: That framing turns out to be enormously consequential. The Transformer is doing what it's good at — broad pattern matching, parallel context absorption — and the small attractor module is doing the iterative refinement piece, the part Transformers can't natively do. They're playing complementary roles. And because the draft already lives in output-embedding space, the refinement is moving inside the space of meaningful answers from the very first step.
12:23Tyler: Juniper, what does that actually buy you on benchmarks? The architecture is elegant; let's see if it delivers.
12:30Juniper: That's your thread to run.
12:32Tyler: It is. So on the language modeling side, the comparison they care about is against a model called Parcae — that's the current best stable looped language model, also a twenty-twenty-six paper. Both architectures get trained on the same data budget, same optimizer, same schedule, three sizes: one hundred and forty million parameters, three hundred and seventy million, seven hundred and seventy million. Only the architecture changes. Two numbers worth holding onto. First, the Pareto frontier on perplexity versus training compute. At every size, Attractor Models form a new frontier — lower perplexity for less training compute than either a parameter-matched Transformer or a parameter-matched Parcae. The headline comparison: their seven-hundred-seventy-million-parameter model matches or beats a one-point-three billion parameter Transformer trained on twice as many tokens. That's a real efficiency win. And the training cost is roughly a quarter to a third lower than Parcae across the board — because the solver converges adaptively and stops when it's done, instead of always running for the full unroll depth.
13:42Juniper: And the second number?
13:43Tyler: On downstream evaluations — Lambada and the CORE benchmark suite — the paper reports relative improvements up to nearly twenty percent over a parameter-matched Transformer baseline, with Attractor coming out ahead of Parcae on the head-to-head comparisons too — smaller margins there, but consistent. And almost half — like forty-six percent — reduction in Lambada perplexity at the smallest scale. Those gains compound differently than the raw perplexity gap suggests, because perplexity at the bottom of the loss curve is a tough thing to move.
14:17Juniper: So the architecture is doing real work on plain language modeling. It isn't just a reasoning trick.
14:24Tyler: That's the part that surprised me, honestly. Going in, I thought of this as a clever architecture for a niche — recurrent computation. But the Pareto curve at language modeling scale — that's not a niche claim. That's a claim about pretraining.
14:41Juniper: OK. Now the reasoning regime. This is where the cold open lives.
14:46Tyler: Right. So in the reasoning setup, the architecture is different in some ways and the same in others. They take their fixed-point framing and apply it to tiny models — around twenty-seven million parameters — trained on about a thousand examples of really hard puzzles. Sudoku-Extreme and Maze-Hard.
15:08Juniper: A thousand examples. Not a thousand puzzles per training step — a thousand puzzles, full stop.
15:14Tyler: That's the data budget. Tiny model, tiny dataset, hard task. And what they're testing is whether the fixed-point formulation gives the model the right kind of inductive bias for algorithmic computation — for the grind-it-out-step-by-step thinking that hard Sudoku actually requires. The comparison is twofold. On one side, frontier language models — DeepSeek R1, Claude 3.7, o3-mini. On the other side, the specialized tiny-reasoner family: HRM at seven million parameters, TRM at seven million, and then the natural question — what happens when you scale TRM up to twenty-seven million, matching the Attractor Model. The numbers. The frontier LLMs all score zero percent on Sudoku-Extreme and zero on Maze-Hard. TRM at seven million parameters gets seventy-five percent on Sudoku, eighty-five on Maze. TRM scaled to twenty-seven million collapses — zero and zero. The Attractor Model at twenty-seven million scores about ninety-one percent on Sudoku and about ninety-three on Maze.
16:26Juniper: Same architecture family, same scale, opposite trajectory.
16:31Tyler: That contrast is what makes this result more than a benchmark win. It's diagnostic. There's something about the fixed-point formulation that regularizes whatever was about to collapse in TRM at the larger scale. It's not just that they did better — they did better on the exact scale point where the competitor failed.
16:52Juniper: And the comparison to the frontier LLMs — I want to be careful here, because I think it's the most striking framing in the paper but also the one with the most footnotes. Tyler, what's your read on that?
17:06Tyler: My read is the zero scores are real but they don't mean what the headline implies. Claude and o3 weren't trained on Sudoku. They're not formatted for it. They're being asked to do an algorithm in their head, in token space, on a problem class their training distribution barely touches. The zero is real, but it's not "frontier models are dumb." It's "frontier models can't do this kind of computation in this format." The fair fight is TRM versus Attractor Model at twenty-seven million, and that fight is also a clean win for the paper — it's just less cinematic than "GPT can't do Sudoku."
17:46Juniper: And there's an honesty moment in the paper here worth surfacing. At the seven-million-parameter scale, Attractor Models actually *lose* to TRM. They get something like fifty-four percent on Sudoku versus TRM's seventy-five. The win is only at twenty-seven million.
18:04Tyler: Right. And the authors put that in the paper. They don't hide it. So the honest picture is: Attractor Models are better than TRM at the specific scale where TRM catastrophically fails. That's a narrower story than the headline, and I think it's a fair one. The collapse of TRM at scale is the real phenomenon. The Attractor Model just doesn't collapse.
18:27Juniper: OK. Now I want to spend time on the thing in the paper that I find the most interesting intellectually — and the authors flag it themselves as the most surprising thing they found.
18:40Tyler: This is the internalization piece.
18:43Juniper: This is the internalization piece. So remember the architecture. A big Transformer backbone produces an initial guess. The small attractor module refines that guess iteratively until it stops changing, and that's the answer. Here's what they noticed in the trained models. The initial guess from the backbone — the thing that's supposed to be a crude proposal that needs cleaning up — turns out, after training, to already be at the fixed point. Or close enough to it that running the solver does basically nothing. Their seven-hundred-seventy-million parameter language model, at inference time, needs essentially zero solver iterations. You can take the backbone's first guess, decode it directly through the unembedding, skip the attractor module entirely — and the quality is the same as if you'd run eight refinement steps.
19:38Tyler: So the iterative refinement procedure they trained the whole architecture to do — at inference, you don't actually need it.
19:46Juniper: You don't need it. The backbone learned to produce, in one forward pass, the answer that the iteration would have converged to. The attractor module, in some sense, made itself obsolete.
19:58Tyler: That's wild. And it's the opposite of what you'd naively expect — you'd expect the model to lean on the refiner, because the refiner is what's actually solving the problem during training.
20:11Juniper: Exactly. But what seems to happen is the reverse. The refinement procedure acts as a kind of moving training target for the backbone. The backbone keeps getting gradient signal that says, "your initial guess should be at the place where the iteration would have ended up." And the backbone, which is the high-capacity component, eventually learns to put its initial guess right there. The iteration trained the Transformer to skip the iteration. The image the paper offers — they project the model's internal state during inference onto two dimensions using PCA, and watch it evolve over sixteen iteration steps. In Parcae, the looped baseline, the state keeps wandering through the embedding space, never quite settling. In the Attractor Model, by iteration eight, the state has collapsed onto a single point. In some cases, by iteration one, it's already converged.
21:05Tyler: A student who absorbed the proof technique so well they stopped needing to do the proof.
21:10Juniper: That's the analogy. And what makes this exciting beyond just "here's a new architecture" is the general recipe it points at. Train your model with an expensive iterative procedure embedded in the architecture, and the model absorbs that procedure into its forward pass. The iteration is a teacher. The model is the student. The training does a kind of self-distillation, automatically.
21:34Tyler: And nobody designed that in. It fell out of the training dynamics.
21:39Juniper: That's what gets me. It's emergent. The authors didn't add a loss term that said "make the backbone match the fixed point." They just trained the whole thing end-to-end with the implicit gradient, and the backbone discovered, on its own, that it could save the attractor module some work — then save it more — then save it almost all of the work.
22:01Tyler: I want to spend a minute on one more piece of the technical side, because the paper has a really nice theoretical observation that explains *why* this training is stable when looped training isn't. They call it the implicit gradient barrier.
22:17Juniper: That argument is — yeah, that's worth slowing down on. Here's the setup. The implicit gradient — the thing that flows backward through the equilibrium condition — involves a matrix inverse. Specifically, you need to invert "the identity minus the Jacobian of the refinement step." If the refinement step is contractive — if every iteration pulls things closer together — then that inverse is well-behaved, and the gradient is finite. But suppose during training, the refinement step starts drifting away from contractive. Suppose an eigenvalue of the Jacobian creeps up toward one. The inverse starts blowing up. The gradient gets enormous. And gradient descent — by its own dynamics — can't take a smooth step in a direction where the gradient is exploding to infinity.
23:09Tyler: So the math is policing itself.
23:11Juniper: The math is policing itself. The implicit gradient creates a kind of fence around the contractive regime. If training tries to drift toward dynamics that wouldn't converge, the gradient itself blows up and prevents the step. Gradient descent is structurally confined to stable, convergent iterations. Now, fixed-loop training — the looped Transformer way — has no such fence. The gradient is finite no matter what trajectory the loop traces, even if that trajectory is wildly unstable, even if running one extra iteration would crash everything. So a looped model can absolutely learn weights that produce a great answer at exactly the trained depth, and produce garbage one step deeper. There's nothing structurally protecting against that.
24:01Tyler: And the claim is that this is part of why looped training is fragile and Attractor training isn't.
24:07Juniper: That's the claim. It's a plausibility argument rather than a guarantee — the authors acknowledge it. In practice the "barrier" isn't a literal infinity; it's a region of really large gradients that the optimizer steers around. But the spirit of the argument is that the implicit formulation has a built-in stability prior the explicit formulation doesn't.
24:32Tyler: I want to push on this with my critique hat on, because this is one of the places the paper is operating on intuition rather than proof. The theory says the refinement map needs to be locally contractive for any of this to work. They don't enforce that during training. They argue the implicit gradient creates a soft incentive — the barrier story you just told. But they also note that DEQ, the predecessor, had to add explicit Jacobian regularization to stay stable. This paper doesn't add it, and the authors say they generally observe contractive behavior empirically, but they don't measure it carefully or characterize when it might fail.
25:17Juniper: That's a fair flag.
25:18Tyler: It's not a deal-breaker — the empirical results are strong enough that something is working. But "we observe it" is different from "we proved it." Scale this architecture up another five times, do you still observe it? We don't know yet.
25:35Juniper: While we're on the critique side, what else are you flagging?
25:39Tyler: A couple of things. The two experimental regimes don't actually share a method end-to-end. In the language modeling experiments, the backbone is a big Transformer and the backward pass uses the cheap one-step approximation. In the reasoning experiments, there's no separate Transformer backbone — initialization comes from a deep-supervision scheme borrowed from TRM — and the backward pass uses a more expensive technique called phantom gradients. The cheap trick that works at language model scale doesn't work in the tiny-data regime. They had to fall back to something more careful. So when you read the paper as one unified contribution, you're actually reading two distinct training procedures that share only the fixed-point framing. A skeptic could argue that "Attractor Models" is the architectural diagram, but the actual training recipe is two different recipes for two different regimes. The unified story is real, but it's narrower than the architecture diagram suggests.
26:40Juniper: And the moving-baselines point — Parcae, HRM, TRM are all very recent. Months old, some of them.
26:46Tyler: Right. The looped LM comparison is against Parcae, itself a twenty-twenty-six preprint. The reasoning baselines are from late twenty-twenty-five. The whole field is moving fast enough that "Pareto frontier improvement over current stable looped models" is a claim with a short half-life. Six months from now there will be a new strongest looped LM, and the comparison will need to be redone. That's not a knock — it's just the state of the field.
27:14Juniper: The other thing I'd flag, and the authors acknowledge it themselves, is that equilibrium internalization is observed, not explained. They give an informal argument in the appendix for why it should happen when the backbone has more capacity than the attractor — basically, the big network swallows the small network's job because it can — but they don't prove it. They don't characterize when it would fail. It's the most interesting claim in the paper and also the one with the least theoretical scaffolding.
27:46Tyler: Which is — actually, that's a nice piece of intellectual honesty on their part. The phenomenon is the kind of thing you'd be tempted to oversell, because it makes your method sound like it does even more cool stuff for free. They flag it as a phenomenon to study, not a proven feature.
28:04Juniper: OK, let me pull this together. The paper does three things. It reframes recurrent computation from "unroll the loop" to "solve for where the loop is heading." That reframe alone gives you constant training memory and adaptive inference depth — two properties that have plagued looped models for years. It validates the reframe on two genuinely different regimes: large-scale language modeling, where it builds a new Pareto frontier, and tiny-data hard reasoning, where it solves problems frontier language models can't touch and the previous specialized champion collapses on at scale. And it surfaces a phenomenon — equilibrium internalization — where the model trained with an iterative refinement procedure absorbs that procedure into its forward pass, so the iteration becomes unnecessary at inference. That last piece is the most exciting, because it suggests a general recipe: bake an expensive teacher into your architecture during training, and the model learns to be its own teacher.
29:06Tyler: And the thing I'll be watching for is whether the internalization story generalizes. If you can train a model with any expensive structured computation embedded in the forward pass — not just fixed-point iteration, anything where there's a "correct answer" the architecture grinds toward — and the model learns to skip the grinding, that's a recipe for training models that are more capable than their inference budget would suggest. That's the bigger idea hiding in this paper.
29:35Juniper: A model that gets a tutor during training and doesn't need the tutor at test time.
29:40Tyler: That's worth thinking about beyond this one architecture.
29:44Juniper: Paper's linked in the show notes, along with some further reading if you want to go deeper. This has been AI Papers: A Deep Dive. Thanks for listening.