Echo: The Paper Arguing You Never Needed a KV Cache for Retrieval
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
A pure Mamba-2 scores 3% on the canonical associative recall benchmark. Echo scores 100% — using a fixed-size state about five thousand times smaller than an equivalent KV cache. The argument isn't that attention got better; it's that retrieval was a regression problem all along, and the KV cache is an artifact of solving it the hard way.
What you'll take away
- Why retrieval can be reframed as ridge regression solvable from running sufficient statistics, making the KV cache an implementation choice rather than a necessity
- How Echo's Spectral Koopman Attention uses a lag-one covariance and eigenvalue filter to suppress one-off distractors — a selectivity mechanism standard attention can't express
- The concrete memory comparison: ~77 KB of state for Echo versus ~384 MB per layer of KV cache at 131k tokens
- Why this method gets more accurate with longer sequences, inverting the state-space 'memory cliff'
- Where the headline result is most fragile: scale is capped at 180M parameters, benchmarks lean on synthetic retrieval tasks like MQAR, and ablations don't cleanly separate the closed-form solve from the spectral filter
- Why the wall-clock speedup hasn't landed yet even though the memory win has
Chapters
- 00:00The memory cliff and the three-percent floor
- 03:59Retrieval as regression, not attention
- 07:59Inside Spectral Koopman Attention
- 11:58The headline numbers
- 15:58Steelmanning the skeptics
- 19:57Why the framing matters more than the benchmark
References in this episode
- Zoology: Measuring and Improving Recall in Efficient Language Models — The paper that introduced the MQAR benchmark central to this episode and crystal
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces — The state-space architecture whose 3% MQAR score is the foil for Echo's 100%, an
- Transformers Learn In-Context by Gradient Descent — Background for the episode's key reframing — that trained attention implements a
- Jamba: A Hybrid Transformer-Mamba Language Model — A production example of the hybrid approach Echo argues against — keeping some a
Full transcript
Also available as a plain-text transcript page.
0:00Eric: There's a benchmark in sequence modeling called MQAR — Multi-Query Associative Recall. It's deliberately designed to be cruel to state-space models. You stuff a sequence with key-value pairs, drop thousands of tokens of distractors in between, and ask: what value was bound to this key? On that benchmark, a pure Mamba-two model scores around three percent. That's chance accuracy. That's a coin flip with extra steps. The paper we're talking about today reports one hundred percent. Not "improved." Not "competitive." Perfect, across every configuration they tested.
0:38Bella: And the way they get there is, honestly, the kind of result that makes you rethink what attention is even for. The paper is called "Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators" — Anupama Sridhar and Alexander Johansen, submitted to arXiv on May seventh twenty-twenty-six - and we are recording four days later. Quick production note before we dig in: this episode is AI-generated. I'm Bella, that's Eric — we're both AI voices from Eleven Labs, and the script was written by Anthropic's Claude Opus 4.7. The producer isn't affiliated with either company. And the reason we wanted a deep dive on this particular paper is that the jump from three percent to one hundred isn't an engineering win. It's a structural argument about what the KV cache is even for.
1:32Eric: So let's set up the stakes, because that three-versus-hundred number doesn't mean much without the backstory. For the last few years, sequence modeling has been a tug-of-war between two architectures. On one side, transformers — every token attends to every previous token. Beautiful, exact, expensive. The cost shows up as the KV cache. Every token you've ever seen leaves behind a "key" and a "value" stored in GPU memory, and at long contexts that cache eats more memory than the model weights themselves. Picture a meeting where, before every new sentence you speak, you have to re-read your complete notes from every previous sentence. The KV cache is those notes. The longer the meeting, the more shelf space they take.
2:22Bella: On the other side, state-space models. Mamba, Mamba-two, that whole family. They process the sequence by maintaining one fixed-size hidden state that gets updated as each new token arrives. Constant memory, no matter how long the context gets. The trade is that the state is a running summary — everything you've seen gets folded into it, and nothing is kept verbatim. It's like trying to remember a phone call by maintaining a single running impression of the conversation rather than a transcript. Useful for vibes. Bad for "what was the phone number Alice gave me twenty minutes ago?"
3:02Eric: And that's where the cliff comes in. The authors have this phrase — the memory cliff — which I think is exactly the right metaphor. With a state-space model, accuracy doesn't gracefully degrade as the binding gets further away. It collapses. Once the gap between when a fact was stored and when you query for it exceeds the state's horizon, the model isn't just worse, it's at chance. Three percent on MQAR regardless of model size. You can scale a Mamba to be much bigger; you do not fix this by scaling.
3:37Bella: Right — the cliff is structural. It's about the size of the state, not the size of the model. And the industry's response has been hybrids: mostly state-space layers for efficiency, with a handful of attention layers sprinkled in to handle the lookups. NVIDIA's Nemotron is the production example — they replace up to ninety-two percent of attention layers with state-space ones. It works, but you've reintroduced the KV cache for the remaining attention layers. You've made the problem smaller; you haven't eliminated it. The paper's opening question is the obvious one nobody had really pressed on: do we actually need attention to do retrieval at all? Or is there a constant-memory mechanism that does the same job?
4:22Eric: That's the setup. Now Bella, this is where I want you to walk us through the reframing, because the core move in this paper is conceptual, not engineering. They're not optimizing attention. They're throwing it out.
4:35Bella: Yeah, this is the part of the paper I keep going back to. So. There's a recent thread of theory work — Akyürek, Mahankali, Goel and collaborators — showing something kind of remarkable about what attention does when it's trained well. If you take softmax attention and train it to convergence on a retrieval-style objective, what it converges to is, mathematically, a classical statistical estimator. Specifically, ridge regression. The thing your grandfather did in the nineteen-seventies. Ridge regression is the workhorse of classical statistics. Given a bunch of input-output examples, find the best linear map from inputs to outputs, with a small penalty for overfitting. It has a closed-form solution. You can write it down. And the pieces of that solution — they're sums. They're running totals over your data. You don't need to revisit old examples. You stream through, add one term per example, and at the end you have everything you need to compute the answer.
5:37Eric: Wait — so if the solution is a sum of running totals, and the totals never change once you've added them, you're saying you can compute this in constant memory.
5:48Bella: That's the move. And this is the moment in the paper where, when I first read it, I had to put the laptop down. The standard mental model is: attention is the primitive, the KV cache is what makes it work, and recurrence is an efficiency hack that loses accuracy. The authors invert the whole thing. They argue: retrieval is a regression problem. Attention is one way to solve it — by grinding through gradient descent during training, with linear memory cost. But there's another way: just compute the closed-form answer directly, from sufficient statistics that fit in a tiny fixed-size table. Here's the analogy that helps me. Imagine you're a bookkeeper at a store, but you're only allowed one small notebook. The state-space strategy is: write a running summary of "what shopping today has felt like." It's a vibe. Useful for general impressions, useless if someone asks "did Alice buy milk?" The new strategy is different. Instead of summaries, keep running totals of specific cross-products — totals you've pre-decided will let you answer any future "who bought what" question by doing a small calculation at query time. You haven't kept the receipts. You've kept exactly the bookkeeping that future retrieval questions will need.
7:08Eric: And the notebook is the same size either way. The difference is in what you chose to write down.
7:15Bella: The difference is what you chose to write down. The fixed-size state isn't a lossy bottleneck — it's an exact sufficient statistic for the retrieval computation. That's the thesis in one sentence. The paper actually uses that exact phrasing, and I think it's the right one. The KV cache, in this view, isn't a necessary feature of retrieval. It's an artifact of choosing to solve the regression problem by gradient descent over attention parameters rather than directly.
7:45Eric: Okay. So if this works — if you can just accumulate running totals and solve the regression in closed form — what does the architecture actually look like? At some point this has to be code that runs on a GPU.
8:00Bella: Right. So the architecture is called Echo, and the layer that replaces attention is called Spectral Koopman Attention, SKA. Each SKA layer maintains three accumulators, all small fixed-size matrices. The first is the Gram matrix of keys — a running table of dot products between all the keys you've seen. The second is a cross-covariance between values and keys. Those two together are enough to do the ridge regression solve. Then there's a third accumulator, and this is the genuinely new ingredient — a lag-one covariance. It captures the relationship between each key and the one that came just before it. Each new token contributes a small additive update to each of these three matrices. No previous contributions get disturbed. The matrices are small — dimension around fifty. The total state is tiny.
8:55Eric: Hold on — Bella, you flagged the lag-one piece as the new one. The first two are basically the standard sufficient statistics for ridge regression. What does the lag-one accumulator buy you that the other two don't?
9:10Bella: This is where the Koopman piece comes in, and it's the part of the paper that goes beyond what closed-form ridge regression alone would give you. The lag-one matrix lets you fit a linear operator to the key sequence. You treat the keys as if they were the state of some dynamical system, and you ask: what linear map best describes how the state evolves from one step to the next? That operator has eigenvalues. And those eigenvalues turn out to be a really useful diagnostic. Koopman operator theory comes out of nineteen-thirty-one physics. It's a way of analyzing nonlinear dynamical systems by finding a linear operator that describes how observable quantities evolve. The eigenvalues of that operator tell you which patterns persist over time and which decay. Eigenvalues near one — those are persistent modes. The pattern sticks around. Eigenvalues near zero — those are transient. Noise. One-off distractors that die out quickly.
10:15Eric: So you're using "does this pattern persist" as a proxy for "is this binding worth retrieving."
10:21Bella: That's the prior. The bindings worth retrieving look like stable modes of a dynamical system fitted to the key history. Bindings that look like one-off noise — they get filtered out. And the mechanism is just raising the fitted operator to a small power. Power two, in practice. If a mode has eigenvalue zero-point-nine, raising to the second power gives you zero-point-eight-one — you've kept about eighty percent of its energy. If a mode has eigenvalue zero-point-three, the second power gives you zero-point-zero-nine — you've crushed it down to nine percent. Persistent modes survive almost intact. Transient ones get exponentially suppressed.
11:05Eric: And here's what I want the listener to hear — attention has nothing that looks like this. There's no place in standard attention where you do something analogous to "fit a dynamical model to the keys and amplify the persistent modes." It just doesn't exist in the math.
11:23Bella: It doesn't exist. This is the structural prior that Echo adds on top of what attention could in principle approximate. It's also why the paper isn't just "ridge regression instead of attention." It's "ridge regression plus a spectral selectivity mechanism that attention can't express even with infinite training."
11:45Eric: Okay. Let's earn the headline number. Three percent versus one hundred on MQAR. Pure Mamba-two: three percent. Echo, with Spectral Koopman Attention layers: one hundred. Across every configuration they tested — including the hard ones, four-thousand-token distractor gaps with thirty-two key-value pairs scattered through them. Perfect.
12:08Bella: And the length generalization result is, in some ways, even more telling. They trained at sequence length sixty-four — that's the training horizon. Then they evaluated at lengths up to four thousand ninety-six tokens. Sixty-four times the training length. At four thousand ninety-six tokens, pure state-space drops to two percent. The hybrid — state-space with attention layers — drops to five percent. State-space with Spectral Koopman Attention holds at sixty-five percent. Not perfect, but it's the only one of the three that's still doing something at sixty-four times the training horizon.
12:48Eric: And the reason — correct me here, Bella — the reason it actually gets *better* with more sequence is the perturbation bound from the theory. More tokens means the Gram matrix becomes better conditioned, which means the operator estimate becomes more accurate. Longer sequences make this method more reliable, not less.
13:10Bella: That's exactly right. And that's the precise inversion of the memory cliff. State-space models degrade with distance because the recurrent state is losing information. SKA *improves* with distance because more data tightens the regression estimate. They're doing opposite things as the sequence gets longer.
13:31Eric: Here's the number that, when I saw it, I went back and double-checked. For their one-hundred-eighty-million-parameter Echo model, the total SKA state across two layers is about seventy-seven kilobytes. An equivalent attention layer's KV cache at one hundred thirty-one thousand tokens — that's a long context, the kind of context you'd want for agentic workloads — at the same hidden dimension, in half-precision, would need three hundred eighty-four megabytes. For each layer.
14:02Bella: So roughly a five-thousand-fold memory reduction at long context. And you can frame it however you want — it's the difference between needing a filing cabinet for every conversation versus a single index card. The index card holds the running totals. The filing cabinet holds every word. And the model with the index card does just as well on retrieval. Better, in fact, because the totals get more reliable the longer the conversation goes.
14:32Eric: There's a data-efficiency claim that's worth pulling out too, though we should hold it lightly. They trained the one-hundred-eighty-million Echo on ten billion tokens of FineWeb-Edu. They compare it to same-size-class transformer, GatedDeltaNet, Mamba-two, and Mamba-three baselines trained on one hundred billion tokens. Ten times less training data. Echo matches or beats those baselines on five out of six zero-shot benchmarks.
15:00Bella: So this is where I want to put a finger on the scale, because that headline — "matches one-hundred-billion-token baselines with ten billion" — is doing some work we should be honest about. The comparison isn't perfectly controlled. Different training recipes, different optimizers, different data mixtures going into those published baseline numbers. And the absolute scores we're talking about — HellaSwag in the low forties, LAMBADA around forty — they're in a regime where the gaps between architectures are small and benchmark noise is real.
15:37Eric: The authors also throw in a comparison to GPT-two at three-hundred-forty-five million parameters. Echo at one-hundred-eighty scores forty-four on HellaSwag versus GPT-two's forty-three. At half the parameters. It's a real result. But comparing to a model from twenty-nineteen, trained with a twenty-nineteen recipe, is the kind of comparison that flatters any recent architecture.
16:03Bella: The way to handle the data-efficiency claim, I think — take it seriously as a signal, hold it loosely as a precise number. Echo is competitive in that comparison, not dominant.
16:15Eric: And the bigger version of that caveat — the one the authors are upfront about — is scale. The biggest model in this paper is one hundred eighty million parameters, trained on a single B200 GPU. The MQAR result at fifty million is decisive. The language modeling comparisons at one-hundred-eighty are favorable. But the field has a long track record of architectures that look great at small scale and stall at billion-parameter scale. The authors say billion-parameter validation is future work. A skeptic would say the verdict isn't in until that scaling holds.
16:54Bella: There's a second steelman that deserves real weight, Eric. The benchmark suite is heavily weighted toward synthetic retrieval tasks — MQAR, needle-in-a-haystack, tool-trace, multi-hop. They all share a really clean fact-distractor-query structure. The authors acknowledge this in the paper directly. They haven't run RULER. They haven't run BABILong. Those are the benchmarks that test recall on messy, natural-language documents where the "key" and "value" structure isn't laid out the way MQAR lays it out.
17:29Eric: And MQAR is, candidly, the canonical benchmark designed to expose this exact state-space failure mode. It came out of the Zoology paper from Arora and colleagues, constructed specifically to crystallize "state-space models can't do associative recall." Hitting one hundred on it is impressive. It's also hitting one hundred on the test engineered to reward a method like this one.
17:55Bella: That's fair. The question that follows is: does the advantage transfer to natural language, where the binding structure is implicit rather than explicit? I think that's genuinely open. The mechanism — running totals plus spectral filtering — should still work. The eigenvalue prior of "persistent patterns matter" is plausibly the right inductive bias for natural language too. But "plausibly" is doing work in that sentence. RULER and BABILong would settle it.
18:26Eric: There's one more thing in the critique bucket worth flagging. The Koopman piece — the lag-one covariance and the spectral filter — is the part of Echo that has no analogue in attention. It's also the most novel part of the paper. So you'd really want to know: how much of Echo's advantage is *just* closed-form ridge regression, the thing you could in principle do without Koopman at all? And how much is specifically the spectral filter doing the work?
18:57Bella: And the ablation in the paper that gets closest to answering that — it's mask-on versus mask-off, not power-zero versus power-two. A cleaner test would compare pure ridge regression — closed form, no spectral piece — against the full Spectral Koopman Attention. As written, you can argue the spectral piece matters. But you can't quite pin down how much of the gap is "closed-form ridge regression beats attention's iterative approximation" versus "the eigenvalue filter is the secret sauce."
19:31Eric: Neither of those critiques means the paper is wrong. The result is real. It's just that the "why" has more wiggle room than the headline suggests. Closed-form ridge regression on streaming sums could be most of the win. The Koopman piece could be icing. Or it could be the other way around. The paper claims the spectral filter is doing real work; the ablations don't quite isolate it.
19:57Bella: One more limitation worth surfacing, because the authors are honest about it. The closed-form solve requires a small Cholesky factorization at query time. In principle that's a tiny matrix and a tiny cost. In practice, the current implementation runs in PyTorch in full precision, and at the small matrix sizes involved, GPU parallelism is underutilized. The wall-clock numbers in the paper don't yet reflect the theoretical efficiency advantage. They flag that custom kernel fusion is in progress. It's not a fatal issue — it's an engineering gap — but it means the memory win we quoted earlier is real and the speed win is still on the come.
20:40Eric: Good. So let me try to land where this paper actually sits, Bella. The practical story is one number: three hundred eighty-four megabytes versus seventy-seven kilobytes at long context. If this scales, it changes what hardware you need to deploy long-context models. Agentic workloads, long chain-of-thought reasoning, multi-document retrieval — those are exactly the regimes where the KV cache becomes the dominant cost in production. Eliminating it without sacrificing recall quality is the kind of thing that moves cost curves.
21:16Bella: The intellectual story is bigger, though. The standard mental model has been: attention is the retrieval primitive, recurrence is a memory-saving compromise. Echo argues the opposite framing is closer to right. Retrieval is fundamentally a regression problem. Attention is one way to solve it — through gradient descent over its parameters, with linear memory cost. There's another way — sufficient statistics and a closed-form solve, with constant memory. The KV cache isn't a necessary feature of content-addressed retrieval. It's an artifact of which solver you picked. If that framing holds at scale, you can imagine an architecture family where state-space models handle local context, lightweight operator modules handle long-range lookups, and the whole thing fits in a fixed memory envelope regardless of how long the context gets. That's a different design space from "transformer with efficiency tricks bolted on."
22:15Eric: And there's a small bonus insight in the paper worth mentioning without going down the rabbit hole. There's a separate argument — really a whole appendix — that this whole setup also has nicer gradient properties than attention. Something about the softmax in attention saturating during training and killing the effective rank of the gradient signal at the language modeling head. SKA doesn't have that issue because it doesn't have a softmax in that spot. Real observation. Second main idea on top of an already-substantive first one. The paper has more in it than one episode can carry.
22:53Bella: The spine of what they're claiming is: retrieval is regression, sufficient statistics fit in constant memory, the Koopman filter adds a selectivity mechanism attention can't express, and the MQAR result is the empirical proof of concept. Everything else — the gradient flow analysis, the Koopman-style replacement for the feedforward block, the perturbation bounds in the appendices — is support and ornament around that core argument.
23:21Eric: The papers that move the field aren't usually the ones with one more percentage point on a benchmark. They're the ones that change what the benchmark is measuring. Echo's pitch isn't "we're better at attention." It's "you didn't need attention for this in the first place." Whether that pitch survives at scale is the open question. The framing itself is the contribution, and it's hard to un-see once you've seen it.
23:45Bella: This episode was produced on May eleventh, twenty-twenty-six. You've been listening to AI Papers: A Deep Dive. The paper and some related reading are linked in the show notes if you want to keep pulling on this thread.
23:58Eric: Thanks for listening.