All episodes
Episode 033 · May 11, 2026 · 24 min

Echo: The Paper Arguing You Never Needed a KV Cache for Retrieval

Sridhar, Johansen

Sequence Modeling
AI Papers: A Deep Dive — Episode 033: Echo: The Paper Arguing You Never Needed a KV Cache for Retrieval — cover art
paperdive.ai
Ep. 033
Echo: The Paper Arguing You Never Needed a KV Cache for Retrieval
0:00
24 min
Paper
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
Venue
arXiv:2605.06997
Year
2026
Read the paper
arxiv.org/abs/2605.06997
Also available on
Apple Podcasts Spotify

A pure scores 3% on the canonical benchmark. scores 100% — using a fixed-size state about five thousand times smaller than an equivalent . The argument isn't that got better; it's that retrieval was a regression problem all along, and the KV cache is an artifact of solving it the hard way.

What you'll take away

  • Why retrieval can be reframed as solvable from running , making the an implementation choice rather than a necessity
  • How 's uses a lag-one covariance and filter to suppress one-off distractors — a selectivity mechanism standard can't express
  • The concrete memory comparison: ~77 KB of state for versus ~384 MB per layer of at 131k
  • Why this method gets more accurate with longer sequences, inverting the ''
  • Where the headline result is most fragile: scale is capped at 180M parameters, benchmarks lean on synthetic retrieval tasks like , and don't cleanly separate the closed-form solve from the spectral filter
  • Why the wall-clock speedup hasn't landed yet even though the memory win has

Chapters

  1. 00:00The memory cliff and the three-percent floor
  2. 03:59Retrieval as regression, not attention
  3. 07:59Inside Spectral Koopman Attention
  4. 11:58The headline numbers
  5. 15:58Steelmanning the skeptics
  6. 19:57Why the framing matters more than the benchmark

References in this episode

Also available as a plain-text transcript page.

0:00Eric: There's a benchmark in sequence modeling called — Multi-Query Associative Recall. It's deliberately designed to be cruel to . You stuff a sequence with key-value pairs, drop thousands of of distractors in between, and ask: what value was bound to this key? On that benchmark, a pure -two model scores around three percent. That's chance accuracy. That's a coin flip with extra steps. The paper we're talking about today reports one hundred percent. Not "improved." Not "competitive." Perfect, across every configuration they tested.

0:38Bella: And the way they get there is, honestly, the kind of result that makes you rethink what is even for. The paper is called ": KV-Cache-Free Associative Recall with Spectral Operators" — Anupama Sridhar and Alexander Johansen, submitted to arXiv on May seventh twenty-twenty-six - and we are recording four days later. Quick production note before we dig in: this episode is AI-generated. I'm Bella, that's Eric — we're both AI voices from Eleven Labs, and the script was written by Anthropic's . The producer isn't affiliated with either company. And the reason we wanted a deep dive on this particular paper is that the jump from three percent to one hundred isn't an engineering win. It's a structural argument about what the is even for.

1:32Eric: So let's set up the stakes, because that three-versus-hundred number doesn't mean much without the backstory. For the last few years, sequence modeling has been a tug-of-war between two architectures. On one side, — every attends to every previous token. Beautiful, exact, expensive. The cost shows up as the . Every token you've ever seen leaves behind a "key" and a "value" stored in GPU memory, and at long contexts that cache eats more memory than the model themselves. Picture a meeting where, before every new sentence you speak, you have to re-read your complete notes from every previous sentence. The KV cache is those notes. The longer the meeting, the more shelf space they take.

2:22Bella: On the other side, . , Mamba-two, that whole family. They process the sequence by maintaining one fixed-size that gets updated as each new arrives. Constant memory, no matter how long the context gets. The trade is that the state is a running summary — everything you've seen gets folded into it, and nothing is kept verbatim. It's like trying to remember a phone call by maintaining a single running impression of the conversation rather than a transcript. Useful for vibes. Bad for "what was the phone number Alice gave me twenty minutes ago?"

3:02Eric: And that's where the cliff comes in. The authors have this phrase — the — which I think is exactly the right metaphor. With a , accuracy doesn't gracefully degrade as the gets further away. It collapses. Once the gap between when a fact was stored and when you query for it exceeds the state's horizon, the model isn't just worse, it's at chance. Three percent on regardless of model size. You can scale a to be much bigger; you do not fix this by scaling.

3:37Bella: Right — the cliff is structural. It's about the size of the state, not the size of the model. And the industry's response has been hybrids: mostly layers for efficiency, with a handful of layers sprinkled in to handle the lookups. NVIDIA's is the production example — they replace up to ninety-two percent of attention layers with state-space ones. It works, but you've reintroduced the for the remaining attention layers. You've made the problem smaller; you haven't eliminated it. The paper's opening question is the obvious one nobody had really pressed on: do we actually need attention to do retrieval at all? Or is there a constant-memory mechanism that does the same job?

4:22Eric: That's the setup. Now Bella, this is where I want you to walk us through the reframing, because the core move in this paper is conceptual, not engineering. They're not optimizing . They're throwing it out.

4:35Bella: Yeah, this is the part of the paper I keep going back to. So. There's a recent thread of theory work — Akyürek, Mahankali, Goel and collaborators — showing something kind of remarkable about what does when it's trained well. If you take softmax attention and train it to convergence on a retrieval-style objective, what it converges to is, mathematically, a classical statistical estimator. Specifically, . The thing your grandfather did in the nineteen-seventies. Ridge regression is the workhorse of classical statistics. Given a bunch of input-output examples, find the best linear map from inputs to outputs, with a small penalty for overfitting. It has a closed-form solution. You can write it down. And the pieces of that solution — they're sums. They're running totals over your data. You don't need to revisit old examples. You stream through, add one term per example, and at the end you have everything you need to compute the answer.

5:37Eric: Wait — so if the solution is a sum of running totals, and the totals never change once you've added them, you're saying you can compute this in constant memory.

5:48Bella: That's the move. And this is the moment in the paper where, when I first read it, I had to put the laptop down. The standard mental model is: is the primitive, the is what makes it work, and is an efficiency hack that loses accuracy. The authors invert the whole thing. They argue: retrieval is a regression problem. Attention is one way to solve it — by grinding through descent during training, with linear memory cost. But there's another way: just compute the closed-form answer directly, from that fit in a tiny fixed-size table. Here's the analogy that helps me. Imagine you're a bookkeeper at a store, but you're only allowed one small notebook. The strategy is: write a running summary of "what shopping today has felt like." It's a vibe. Useful for general impressions, useless if someone asks "did Alice buy milk?" The new strategy is different. Instead of summaries, keep running totals of specific cross-products — totals you've pre-decided will let you answer any future "who bought what" question by doing a small calculation at query time. You haven't kept the receipts. You've kept exactly the bookkeeping that future retrieval questions will need.

7:08Eric: And the notebook is the same size either way. The difference is in what you chose to write down.

7:15Bella: The difference is what you chose to write down. The fixed-size state isn't a lossy bottleneck — it's an exact for the retrieval computation. That's the thesis in one sentence. The paper actually uses that exact phrasing, and I think it's the right one. The , in this view, isn't a necessary feature of retrieval. It's an artifact of choosing to solve the regression problem by descent over parameters rather than directly.

7:45Eric: Okay. So if this works — if you can just accumulate running totals and solve the regression in closed form — what does the architecture actually look like? At some point this has to be code that runs on a GPU.

8:00Bella: Right. So the architecture is called , and the layer that replaces is called , SKA. Each SKA layer maintains three accumulators, all small fixed-size matrices. The first is the Gram matrix of keys — a running table of dot products between all the keys you've seen. The second is a cross-covariance between values and keys. Those two together are enough to do the solve. Then there's a third accumulator, and this is the genuinely new ingredient — a lag-one covariance. It captures the relationship between each key and the one that came just before it. Each new contributes a small additive update to each of these three matrices. No previous contributions get disturbed. The matrices are small — dimension around fifty. The total state is tiny.

8:55Eric: Hold on — Bella, you flagged the lag-one piece as the new one. The first two are basically the standard for . What does the lag-one accumulator buy you that the other two don't?

9:10Bella: This is where the piece comes in, and it's the part of the paper that goes beyond what closed-form alone would give you. The lag-one matrix lets you fit a linear operator to the key sequence. You treat the keys as if they were the state of some dynamical system, and you ask: what linear map best describes how the state evolves from one step to the next? That operator has . And those eigenvalues turn out to be a really useful diagnostic. Koopman operator theory comes out of nineteen-thirty-one physics. It's a way of analyzing nonlinear dynamical systems by finding a linear operator that describes how observable quantities evolve. The eigenvalues of that operator tell you which patterns persist over time and which decay. Eigenvalues near one — those are persistent modes. The pattern sticks around. Eigenvalues near zero — those are transient. Noise. One-off distractors that die out quickly.

10:15Eric: So you're using "does this pattern persist" as a proxy for "is this worth retrieving."

10:21Bella: That's the prior. The worth retrieving look like stable modes of a dynamical system fitted to the key history. Bindings that look like one-off noise — they get filtered out. And the mechanism is just raising the fitted operator to a small power. Power two, in practice. If a mode has zero-point-nine, raising to the second power gives you zero-point-eight-one — you've kept about eighty percent of its energy. If a mode has eigenvalue zero-point-three, the second power gives you zero-point-zero-nine — you've crushed it down to nine percent. Persistent modes survive almost intact. Transient ones get exponentially suppressed.

11:05Eric: And here's what I want the listener to hear — has nothing that looks like this. There's no place in standard attention where you do something analogous to "fit a dynamical model to the keys and amplify the persistent modes." It just doesn't exist in the math.

11:23Bella: It doesn't exist. This is the structural prior that adds on top of what could in principle approximate. It's also why the paper isn't just " instead of attention." It's "ridge regression plus a spectral selectivity mechanism that attention can't express even with infinite training."

11:45Eric: Okay. Let's earn the headline number. Three percent versus one hundred on . Pure -two: three percent. , with layers: one hundred. Across every configuration they tested — including the hard ones, four-thousand- distractor gaps with thirty-two key-value pairs scattered through them. Perfect.

12:08Bella: And the length generalization result is, in some ways, even more telling. They trained at sequence length sixty-four — that's the training horizon. Then they evaluated at lengths up to four thousand ninety-six . Sixty-four times the training length. At four thousand ninety-six tokens, pure drops to two percent. The hybrid — state-space with layers — drops to five percent. State-space with holds at sixty-five percent. Not perfect, but it's the only one of the three that's still doing something at sixty-four times the training horizon.

12:48Eric: And the reason — correct me here, Bella — the reason it actually gets *better* with more sequence is the perturbation bound from the theory. More means the Gram matrix becomes better conditioned, which means the operator estimate becomes more accurate. Longer sequences make this method more reliable, not less.

13:10Bella: That's exactly right. And that's the precise inversion of the . State-space models degrade with distance because the is losing information. *improves* with distance because more data tightens the regression estimate. They're doing opposite things as the sequence gets longer.

13:31Eric: Here's the number that, when I saw it, I went back and double-checked. For their one-hundred-eighty-million-parameter model, the total state across two layers is about seventy-seven kilobytes. An equivalent layer's at one hundred thirty-one thousand — that's a long context, the kind of context you'd want for workloads — at the same hidden dimension, in half-precision, would need three hundred eighty-four megabytes. For each layer.

14:02Bella: So roughly a five-thousand-fold memory reduction at long context. And you can frame it however you want — it's the difference between needing a filing cabinet for every conversation versus a single index card. The index card holds the running totals. The filing cabinet holds every word. And the model with the index card does just as well on retrieval. Better, in fact, because the totals get more reliable the longer the conversation goes.

14:32Eric: There's a data-efficiency claim that's worth pulling out too, though we should hold it lightly. They trained the one-hundred-eighty-million on ten billion of . They compare it to same-size-class , , -two, and Mamba-three baselines trained on one hundred billion tokens. Ten times less training data. Echo matches or beats those baselines on five out of six zero-shot benchmarks.

15:00Bella: So this is where I want to put a finger on the scale, because that headline — "matches one-hundred-billion- baselines with ten billion" — is doing some work we should be honest about. The comparison isn't perfectly controlled. Different training recipes, different optimizers, different data mixtures going into those published baseline numbers. And the absolute scores we're talking about — in the low forties, around forty — they're in a regime where the gaps between architectures are small and benchmark noise is real.

15:37Eric: The authors also throw in a comparison to GPT-two at three-hundred-forty-five million parameters. at one-hundred-eighty scores forty-four on versus GPT-two's forty-three. At half the parameters. It's a real result. But comparing to a model from twenty-nineteen, trained with a twenty-nineteen recipe, is the kind of comparison that flatters any recent architecture.

16:03Bella: The way to handle the data-efficiency claim, I think — take it seriously as a signal, hold it loosely as a precise number. is competitive in that comparison, not dominant.

16:15Eric: And the bigger version of that caveat — the one the authors are upfront about — is scale. The biggest model in this paper is one hundred eighty million parameters, trained on a single B200 GPU. The result at fifty million is decisive. The language modeling comparisons at one-hundred-eighty are favorable. But the field has a long track record of architectures that look great at small scale and stall at billion-parameter scale. The authors say billion-parameter validation is future work. A skeptic would say the verdict isn't in until that scaling holds.

16:54Bella: There's a second steelman that deserves real , Eric. The benchmark suite is heavily weighted toward synthetic retrieval tasks — , needle-in-a-haystack, tool-trace, multi-hop. They all share a really clean fact-distractor-query structure. The authors acknowledge this in the paper directly. They haven't run . They haven't run . Those are the benchmarks that test recall on messy, natural-language documents where the "key" and "value" structure isn't laid out the way MQAR lays it out.

17:29Eric: And is, candidly, the canonical benchmark designed to expose this exact failure mode. It came out of the Zoology paper from Arora and colleagues, constructed specifically to crystallize "state-space models can't do ." Hitting one hundred on it is impressive. It's also hitting one hundred on the test engineered to reward a method like this one.

17:55Bella: That's fair. The question that follows is: does the advantage transfer to natural language, where the structure is implicit rather than explicit? I think that's genuinely open. The mechanism — running totals plus spectral filtering — should still work. The prior of "persistent patterns matter" is plausibly the right inductive bias for natural language too. But "plausibly" is doing work in that sentence. and would settle it.

18:26Eric: There's one more thing in the critique bucket worth flagging. The piece — the lag-one covariance and the spectral filter — is the part of that has no analogue in . It's also the most novel part of the paper. So you'd really want to know: how much of Echo's advantage is *just* closed-form , the thing you could in principle do without Koopman at all? And how much is specifically the spectral filter doing the work?

18:57Bella: And the in the paper that gets closest to answering that — it's mask-on versus mask-off, not power-zero versus power-two. A cleaner test would compare pure — closed form, no spectral piece — against the full . As written, you can argue the spectral piece matters. But you can't quite pin down how much of the gap is "closed-form ridge regression beats 's iterative approximation" versus "the filter is the secret sauce."

19:31Eric: Neither of those critiques means the paper is wrong. The result is real. It's just that the "why" has more wiggle room than the headline suggests. Closed-form on streaming sums could be most of the win. The piece could be icing. Or it could be the other way around. The paper claims the spectral filter is doing real work; the don't quite isolate it.

19:57Bella: One more limitation worth surfacing, because the authors are honest about it. The closed-form solve requires a small at query time. In principle that's a tiny matrix and a tiny cost. In practice, the current implementation runs in PyTorch in full precision, and at the small matrix sizes involved, GPU parallelism is underutilized. The wall-clock numbers in the paper don't yet reflect the theoretical efficiency advantage. They flag that custom kernel fusion is in progress. It's not a fatal issue — it's an engineering gap — but it means the memory win we quoted earlier is real and the speed win is still on the come.

20:40Eric: Good. So let me try to land where this paper actually sits, Bella. The practical story is one number: three hundred eighty-four megabytes versus seventy-seven kilobytes at long context. If this scales, it changes what hardware you need to deploy long-context models. Agentic workloads, long , multi-document retrieval — those are exactly the regimes where the becomes the dominant cost in production. Eliminating it without sacrificing recall quality is the kind of thing that moves cost curves.

21:16Bella: The intellectual story is bigger, though. The standard mental model has been: is the retrieval primitive, is a memory-saving compromise. argues the opposite framing is closer to right. Retrieval is fundamentally a regression problem. Attention is one way to solve it — through descent over its parameters, with linear memory cost. There's another way — and a closed-form solve, with constant memory. The isn't a necessary feature of content-addressed retrieval. It's an artifact of which solver you picked. If that framing holds at scale, you can imagine an architecture family where handle local context, lightweight operator modules handle long-range lookups, and the whole thing fits in a fixed memory envelope regardless of how long the context gets. That's a different design space from " with efficiency tricks bolted on."

22:15Eric: And there's a small bonus insight in the paper worth mentioning without going down the rabbit hole. There's a separate argument — really a whole appendix — that this whole setup also has nicer properties than . Something about the softmax in attention saturating during training and killing the effective rank of the gradient signal at the language modeling head. doesn't have that issue because it doesn't have a softmax in that spot. Real observation. Second main idea on top of an already-substantive first one. The paper has more in it than one episode can carry.

22:53Bella: The spine of what they're claiming is: retrieval is regression, fit in constant memory, the filter adds a selectivity mechanism can't express, and the result is the empirical proof of concept. Everything else — the flow analysis, the Koopman-style replacement for the feedforward block, the perturbation bounds in the appendices — is support and ornament around that core argument.

23:21Eric: The papers that move the field aren't usually the ones with one more percentage point on a benchmark. They're the ones that change what the benchmark is measuring. 's pitch isn't "we're better at ." It's "you didn't need attention for this in the first place." Whether that pitch survives at scale is the open question. The framing itself is the contribution, and it's hard to un-see once you've seen it.

23:45Bella: This episode was produced on May eleventh, twenty-twenty-six. You've been listening to AI Papers: A Deep Dive. The paper and some related reading are linked in the show notes if you want to keep pulling on this thread.

23:58Eric: Thanks for listening.