All episodes

Episode 085 · May 26, 2026 · 24 min

Why Long-Context Models Might Need Compute, Not Capacity, Before Eviction

Lee, McLeish, Goldstein et al.

Memory Architectures

AI Papers: A Deep Dive — Episode 085: Why Long-Context Models Might Need Compute, Not Capacity, Before Eviction — cover art

paperdive.ai

Listen

Ep. 085

Why Long-Context Models Might Need Compute, Not Capacity, Before Eviction

0:00

24 min

Concepts in this episode

Training Methods AI Efficiency & Cost Evaluation & Benchmarks KV Cache Hybrid SSM/Attention Long Context Test-Time Compute Iterative Refinement Math Reasoning In-Context Learning Scaling Laws Inference Cost Context Management

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Language Models Need Sleep

Venue

arXiv:2605.26099

Year

2026

Read the paper

arxiv.org/abs/2605.26099

Also available on

Apple Podcasts Spotify

For two years the long-context modeling community has been arguing about how much information you can squeeze into a fixed-size memory. A new paper says that's the wrong axis entirely — the bottleneck isn't how big the whiteboard is, it's how much thinking happened while writing on it. The fix is a 'sleep' phase that loops compute over context right before the cache gets cleared, with no cost at answer time.

What you'll take away

The reframe at the heart of the paper: a hybrid model's fast weight isn't a storage device, it's the residue of a one-pass computation — and shallow computation produces shallow residue regardless of capacity
Why the Rule 110 cellular automaton experiment is unusually clean: it holds stored information constant while varying required computation, isolating compute-for-reasoning from memory-for-storage
The deployment win: extra 'sleep' compute is paid during ingestion, not at answer time, so inference latency is unchanged while training cost scales linearly with loop count N
Concrete gains: two-operation GSM-Infinite problems jump from ~60% to ~90% accuracy with four sleep loops in the sliding-window setting; harder six-operation problems on Ouro go from ~42% to ~62%
The honest limits: the real-task gains tangle 'reasoning' with 'retrieval under constrained windows,' comparisons are mostly against the no-loop version of the same architecture, and the method needs careful two-stage training to work
Why the conceptual contribution may outlast the specific mechanism: it splits inference into a compute-rich ingestion phase and a latency-constrained answer phase, a framing likely to show up in other architectures

Chapters

00:00The Polaroid problem and the notebook-vs-whiteboard setup
03:27The reframe: fast weights are computations, not storage
06:55Sleep as depth-recurrence at eviction time
10:23The Rule 110 experiment
13:50Does the result transfer to real tasks?
17:18The skeptic's checklist
20:46What changes about how we think about inference

References in this episode

Universal Transformers — The canonical depth-recurrence paper the episode references — loops transformer
Sleep-time Compute: Beyond Inference Scaling at Test-time — The Lin et al. work Bella name-checks as a parallel 'do offline work before quer
Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Background on the state-space 'whiteboard' that the episode's hybrid models rely
Deep Equilibrium Models — Another point of reference for depth-recurrent architectures, helpful for situat

Full transcript

Also available as a plain-text transcript page.

0:00Bella: Here's a thought experiment that captures what this paper is really about. You're shown a chessboard, you get to take exactly one Polaroid of the position, and then the board is taken away. A few minutes later, someone walks in and asks you: "what's the position fifteen optimal moves from now?" The snapshot has all the information you need, in principle. The full board state is right there on the photo. But a snapshot doesn't *run* anything. It just holds. And no matter how high-resolution your Polaroid is, you can't answer that question by staring harder at it.

0:36Eric: And the claim of this paper is that something very close to that is happening inside modern long-context language models — that we've been spending years arguing about the resolution of the Polaroid when the real bottleneck is that nobody's been given time to think about the board before it gets taken away. The paper is called "Language Models Need Sleep," it went up on arXiv on May twenty-fifth, twenty-twenty-six, and we're recording the next day. What you're hearing is an AI-generated deep dive — I'm Eric, that's Bella, we're both AI voices from Eleven Labs, and the script is from Anthropic's Claude Opus 4.7. The show is produced independently — no affiliation with either company. And the reason the "next day" matters is that the idea in this paper is genuinely fresh — it's a reframe of a problem the whole long-context modeling community has been chewing on for two years, and it's the kind of move that makes you wonder why nobody did it sooner.

1:38Bella: Right — and the reframe is what I want to spend most of this episode on, because the architecture follows from it almost automatically once you see it. So let me set up the world the authors are working in. When a transformer reads a sequence of tokens, it keeps a record of every token in something called the KV cache. Think of it as a notebook where every token gets its own page, written down verbatim. When a new token arrives, the model flips through every page in the notebook and pulls a weighted blend of what it finds. Perfect recall, exact, lossless. But the notebook keeps growing, and flipping through it gets quadratically more expensive.

2:20Eric: Which is the problem everyone's been trying to fix. The notebook can't grow forever — at some point, on long inputs, you have to start tearing pages out.

2:30Bella: Exactly. And the dominant strategy for the last couple of years has been to pair the notebook with a whiteboard. The whiteboard is a fixed-size matrix the authors call a "fast weight" — it lives in what's called a state-space model, or SSM. Every time a new token comes in, the existing writing on the whiteboard gets a little faded — partially erased — and the new token's contribution gets written on top. The size never changes. Old information doesn't vanish cleanly; it gets smeared together with everything else. Lossy, but cheap and constant-cost forever.

3:07Eric: And the pitch is: keep the notebook for recent stuff, use the whiteboard for long-term. When a page falls out of the notebook, its information is supposed to live on in compressed form on the whiteboard. That's the whole architecture of what people call "hybrid" models — some attention layers, some SSM layers, working in tandem.

3:29Bella: And the assumption underneath all of this — the assumption nobody quite said out loud — is that the fast weight is a *storage device*. The argument has always been about compression. How much can we squeeze onto the whiteboard? How big should the whiteboard be? What's the best erase-and-write rule? It's a conversation organized entirely around capacity.

3:52Eric: And the authors' move is to say: that's the wrong question. The fast weight isn't a place where information lives. It's the *output of a computation*. And the computation that produced it is exactly one forward pass deep. So when your hybrid model fails on a long-context reasoning task, it's not because the whiteboard was too small. It's because one pass of writing on the whiteboard wasn't enough thinking time.

4:20Bella: That's the whole paper in one sentence. Scalable memory is not the same as scalable reasoning. And the analogy I keep coming back to is cramming for an exam. Two students get one hour with the same textbook and identical notebooks. One copies passages verbatim. The other reads each chapter four times, rewriting their notes each pass, connecting ideas, working through likely problems. Same notebook, same source material — wildly different exam performance. Because the *value* of the notes isn't determined by how much room you have to write. It's determined by how much thinking you did while writing them.

5:00Eric: And the fix the authors propose follows immediately from that picture. If the problem is that the model only got one pass over the material before the notebook got cleared, give it more passes. Specifically: right before the KV cache is about to be evicted at the end of a context chunk, don't evict yet. Loop the whole stack of network blocks over that chunk N more times, threading the fast weights through each loop. After N passes, *now* clear the cache and move on. They call this phase sleep.

5:32Bella: And the metaphor they're reaching for there is hippocampal consolidation — the neuroscience idea that during sleep, recent experiences in the hippocampus get replayed and gradually written into longer-term cortical representations. I want to flag that the paper is careful here but also a little flirtatious with the biology. The argument doesn't actually require the brain analogy to be accurate. It's a depth-recurrence story dressed in neuroscience clothing. But the metaphor is good — sleep is *costly*, animals can't respond to predators while sleeping, so if evolution paid that cost, consolidation must be doing real work. That's a nice way to motivate why you'd ever burn compute on something that doesn't produce an output.

6:19Eric: And there's a better operational analogy than the brain one, I think, which is a restaurant kitchen between services. During service, orders come in fast and food has to go out fast — no time for deep prep. So between services the kitchen closes, and the staff spends hours reducing stock, dicing vegetables, refining sauces. None of that happens while customers are waiting. And the payoff is that when service starts again, everything moves at normal speed but the food is better. The model's sleep phase is the same shape — the extra work happens during a window when the user isn't watching, and the payoff shows up at normal inference latency.

7:01Bella: That latency point is the practical engineering proposition, and it's worth dwelling on because it's the whole reason this is interesting outside of academia. At prediction time — when the user has asked a question and is staring at the screen waiting — the model still does a single forward pass. Same speed as before. All the extra compute, all the N looped passes, happen during ingestion. During the moment when the context is being consumed and the cache is filling up. If you're building a system that has to be fast at answer time but is free to spend more time during document ingestion, this gives you a lever you didn't have before.

7:42Eric: Okay, so let me push on whether the diagnosis is actually right. Because the claim is strong — they're saying the bottleneck isn't capacity, it's compute. How do they show that?

7:54Bella: This is where the experimental design gets really sharp, and Eric, I think this is your thread because the cleverness of it is mostly in the controls. But the headline setup is the cellular automaton experiment. Want to walk through it?

8:09Eric: Yeah, this one is unusually clean for a deep learning paper. They use something called Rule 110, which is a cellular automaton — a row of cells, each zero or one, that evolves in discrete time steps according to a fixed rule. The listener doesn't need to know the rule. What matters is two properties. First, it's deterministic and sequential — step t-plus-one depends on step t, and there's no shortcut, you have to simulate it forward one step at a time. Second, they hold the sequence length and the amount of information to be stored fixed across the experiment — twenty-four bits per string, the same for every value of t.

8:50Bella: Which is the magic of using it as a test bed.

8:53Eric: Right. So here's what they do. They train a hybrid model on four independent twenty-four-bit binary strings. The model sees one string at a time, but with a hard constraint: the context window is exactly twenty-four tokens, and it gets fully cleared after each string. So by design, by the time the model has seen all four strings, it has been forced to forget the literal tokens. Everything it knows is sitting in the fast weight — the whiteboard. Then they ask the model: for each of the four strings, what's the first bit after evolving the string forward by t steps?

9:32Bella: And t is the dial they get to turn. t equals zero is just memorization — what was the first bit of the string you saw. t equals thirty-two means the model has to have internally simulated thirty-two steps of the cellular automaton.

9:48Eric: And here's the part that isolates the question. Across all values of t, the *information* the model needs to store is identical. Twenty-four bits per string, four strings, ninety-six bits total. The whiteboard's job — the fast weight's job — is the same. What changes is how much *computation* is required to turn those stored bits into an answer.

10:13Bella: And the result is what you'd predict if the authors are right. At t equals zero, the model does fine. As t grows, accuracy collapses. By t equals thirty-two with a single forward pass, the model is at about ten percent accuracy — which on a binary prediction is basically random.

10:33Eric: And then they turn on sleep. Same model, same training data, same fast weight size — they just let it loop N times over each twenty-four-token chunk before evicting. Two loops takes accuracy at t equals thirty-two from about ten percent to about twenty percent. Three or four loops takes it above thirty percent. So with three or four loops, you're roughly tripling accuracy on the hardest setting, at the cost of extra compute paid during *ingestion* — inference-time cost is unchanged.

11:07Bella: And the gap widens with task difficulty, which is the exact signature you'd expect if compute-for-reasoning is the bottleneck. On easy tasks — small t — extra loops don't help much, because one pass was already enough. On hard tasks, the loops are the only thing keeping you above chance.

11:26Eric: Now, Bella, I want to voice the obvious skeptic concern here, because the listener is probably sensing it. The experiment is asking, in effect: does adding sequential computation help on a task that was specifically designed to require sequential computation? Rule 110 is the textbook example of an irreducibly sequential process. The architectural fix is literally to add sequential computation. The result that you get is almost tautological.

11:55Bella: It is, on that task alone. And the paper would be stronger with a result showing sleep *hurts* somewhere — some class of task where extra consolidation passes degrade performance, to give an honest picture of the trade-off. They don't show that. But they do replicate the pattern on less synthetic tasks, which is where the evidence starts feeling more real. They run a multi-hop graph traversal task — shuffled directed cycle, you have to answer "what node is k hops from node X" — and they find the same pattern. One pass is fine for shallow queries, but as the hop count goes up, more loops help. And then they go to GSM-Infinite, which is procedurally generated math word problems with controllable arithmetic depth.

12:44Eric: This is the result that matters most, I think, because GSM-Infinite is using two actual pretrained models. There's Jet-Nemotron, which is a 2-billion parameter hybrid — attention plus SSM layers. And there's Ouro, which is a 1.4-billion-parameter looped attention-only model that they retrofit with SSM layers to test sleep on. So these aren't toy networks built for the paper; they're real recent models that other researchers have trained.

13:14Bella: And the most dramatic GSM-Infinite numbers come from a specific variant — the sliding-window setup, where the attention window is deliberately made smaller than the problem itself, with window length five hundred and twelve. In that sliding-window regime, two-operation math problems go from about sixty percent accuracy to about ninety percent with four sleep loops. That's a fifty percent relative improvement on what should be the *easiest* setting. There's a separate main result on GSM-Infinite at longer context with hard eviction, but the headline jump is in the sliding-window variant. And on harder six-operation problems on Ouro, accuracy goes from about forty-two percent to about sixty-two percent. So the trend from the synthetic tasks holds on actual math problems with a real model.

14:04Eric: Although — and this is the steelman critique that's harder to wave away — the two-operation result is suspicious in a specific way. The reason the baseline does poorly on a two-operation problem is that the active attention window was made smaller than the problem itself. The model literally can't see the whole math problem at once in the baseline. So the gain from sleep there might be primarily about retrieval — about consolidating the part of the problem you saw earlier so you can match it up with what you're seeing now — rather than about reasoning per se. The paper kind of acknowledges this in passing, but the framing presents the gain as evidence for the reasoning hypothesis when it might mostly be about retrieval under an artificially constrained window.

14:54Bella: That's fair. And the right way to read this body of evidence, I think, is: the synthetic tasks establish the *mechanism* — yes, more consolidation passes can buy you sequential reasoning depth. The realistic tasks establish that the *mechanism transfers* to settings people actually care about, but they don't cleanly separate "reasoning gain" from "retrieval gain." The paper's claim that this is fundamentally a compute-for-reasoning story is most convincing on Rule 110, and gets progressively more entangled with retrieval as you move to real tasks. Which is fine — that's how these stories usually work — but the listener should know the picture is messier than the headline.

15:39Eric: There's another thing in the critique pile worth naming, which is that almost all the comparisons in the paper are against the *no-loop version of the same architecture*. So they show that their model with sleep beats their model without sleep. Which is informative, but it doesn't tell you whether sleep is the *best* use of that extra compute. The fair comparison is "spend this same compute somewhere else" — like a larger fast weight, a longer attention window, a different SSM update rule, retrieval augmentation. The paper doesn't really engage with those alternatives, and the comparison against vanilla transformers is structurally rigged because they only test in the eviction regime, which is the regime where transformers are disqualified by construction.

16:30Bella: Right. And the third real cost the authors flag themselves — to their credit — is that training time scales linearly with N. Double the sleep, double the training cost. They note that training is slow and somewhat unstable, that deeper recurrence at training time is a known hard problem, and that the largest N they test is six on Jet-Nemotron and four on Ouro. These are modest-scale fine-tunes. Whether this approach scales to frontier-sized models is genuinely unknown — the paper is honest about that.

17:04Eric: But that linear training cost cuts the other way too, when you think about what's being bought. The training cost scales linearly with N, but the *inference* cost is unchanged. So if you're going to deploy this model and serve it to a lot of users, you pay the N-times training cost *once*, and then you get N-times-deeper consolidation forever at no additional latency. Which is a pretty good deal at scale, if it works.

17:31Bella: Eric, what's your read on how this changes the broader research conversation? Because I think the methodological contribution might end up being smaller than the conceptual one.

17:43Eric: I think you're right about that. The mechanism — loop the blocks N times over a chunk before evicting — is a relatively small architectural tweak. People will tune it, replace it, simplify it. What's harder to take back is the reframe. The way I'd put it: for the last few years, the question in efficient long-context modeling has been "how much can we compress." This paper says — that's the wrong axis. The question is "how much thinking goes into producing the compression." The fast weight isn't a database. It's the residue of a computation. And if the computation was shallow, the residue is shallow, no matter how much space you gave it to live in.

18:27Bella: There's a related thread in the literature on what's called depth-recurrence — running a network's layers multiple times over the same input. Universal Transformers are the canonical reference, there's been work on equilibrium models, and more recently Ouro, the model they use here, is itself a looped architecture. The standard place to put those extra passes has been at *prediction* time — when the user asks a question, loop the model a few times to think harder. Which works, but costs latency.

19:00Eric: And the move here is to take that same depth-recurrence trick and put it somewhere new in the time budget — at the moment of cache eviction, before the user has even asked the question. Same mechanism, different phase. Same compute, different bill. That's a real intellectual contribution, and I think it'll show up in other architectures pretty quickly, even if the specific Rule 110 framing doesn't.

19:26Bella: It also opens up a class of designs we haven't really explored. Right now we mostly think of inference as a single phase — model gets input, model produces output. What this is suggesting is that you might design models with an explicit "ingestion" phase that's compute-rich and an "answer" phase that's latency-constrained, and you push as much of the work as possible into the first phase. There's a contemporary paper by Lin and colleagues called "sleep-time compute" that has a similar spirit — do offline work before queries arrive — though the mechanism is totally different. They generate likely questions and pre-compute. This paper does it through architectural recurrence. But the framing is converging: stop treating inference as a single moment in time.

20:15Eric: And honestly, that framing aligns with how a lot of deployed AI systems already work, if you squint. Document ingestion pipelines, retrieval indexes, embedding caches — all of these are forms of offline pre-computation that buy faster query-time response. This paper is essentially proposing that the *neural network itself* should have an internal version of that — a moment where it pre-processes context before being asked anything.

20:43Bella: There's one specific landing the authors deserve credit for, which is the experimental design on Rule 110. The reason that result is convincing isn't the headline number — it's that they constructed a task where capacity and compute are *provably independent* and then showed that fixing only the compute axis fixes the problem. That kind of clean isolation is rare in deep learning experiments, where everything tends to be entangled with everything else. The synthetic-task critique applies — yes, they designed the task to make the point — but the design itself is the contribution. It gives you a way to ask the question.

21:23Eric: Although the flip side, just to keep the skeptic honest, is that the cleanness of Rule 110 is also why the result transfers ambiguously to real tasks. Real tasks aren't decomposable into "this part needs storage, this part needs sequential reasoning." They're tangled. So you can't actually do the clean isolation on GSM-Infinite or anything anyone cares about. The Rule 110 result tells you the mechanism *exists*. It doesn't tell you what fraction of real-task failures are explained by it. That fraction might be large. It might be small. The paper doesn't really let you tell.

22:00Bella: That's a fair limit on the claims. And one more thing worth flagging in the limitations: the sliding-window version of sleep — the one closer to what you'd actually deploy — required a careful two-stage training warm-up to get it to work. One epoch of SSM-only training with hard eviction first, then turning on the full sliding-window sleep setup. The method isn't a drop-in modification. So even if the conceptual argument is right, the engineering of actually making this work at scale on real models is going to be its own project.

22:35Eric: Which connects to a broader pattern in this kind of research, which is that the cleanest conceptual insights often take a few years to turn into something practitioners can just use. Attention itself took a while. SSMs took a while. If this idea has legs, the version of it that ships in deployed models in a year or two is probably going to look pretty different from what's in this paper — but the bookkeeping about *when* compute gets spent is likely to stick.

23:04Bella: So the takeaway I want to leave the listener with is this. The next time you hear someone talking about long-context language models and how big the memory needs to be, or how compressed the state can get — there's a different question hiding underneath. How much *thinking* happened while that state was being formed? A small whiteboard with deep thinking baked into it can outperform a giant whiteboard scribbled on once. And the moment of consolidation — the moment when context is about to be discarded — turns out to be a surprisingly good place to spend compute, because nobody's waiting.

23:42Eric: The Polaroid of the chessboard is a perfect picture if you've already worked out the next fifteen moves before you took it.

23:50Bella: That's the paper. The show notes have a link to it, plus some related reading on hybrid architectures and depth-recurrent models if you want to go deeper.

24:00Eric: And if you want the full transcript with the jargon defined inline, plus the concept pages that connect this episode to the others we've done on long-context modeling, that's all on paperdive.ai.

24:13Bella: Thanks for listening to AI Papers: A Deep Dive.

Why Long-Context Models Might Need Compute, Not Capacity, Before Eviction

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes