All episodes

Episode 198 · Jul 03, 2026 · 17 min

The Model That Knows the Answer and Can't Say It

Gollapudi, Gupta, Singhal et al.

NLP

AI Papers: A Deep Dive — Episode 198: The Model That Knows the Answer and Can't Say It — cover art

paperdive.ai

Listen

Ep. 198

The Model That Knows the Answer and Can't Say It

0:00

17 min

Concepts in this episode

Evaluation & Benchmarks Training Methods Long Context RAG Transformer Attention Attention Heads Context Fatigue Needle-in-a-Haystack Dense Retrieval LIMIT Benchmark Inference Cost Ablation Studies

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale

Venue

arXiv:2607.01538

Year

2026

Read the paper

arxiv.org/abs/2607.01538

Also available on

Apple Podcasts Spotify

A language model reading a million tokens ranks the correct document first on 100% of queries — and still answers correctly just 0.2% of the time. This episode dissects the first controlled test of whether an LLM can replace the vector database, traces the failure to one piece of softmax arithmetic that drowns the answer as the context grows, and walks through the two fixes that recover most of it. The verdict reframes 'context rot' entirely: for retrieval, long-context failure looks like plumbing, not a capability wall.

What you'll take away

Why acing needle-in-a-haystack tests tells you close to nothing about real retrieval — and how corpora built from hard negatives expose the gap
The autopsy result: at layer nineteen, an attention head ranks the gold document first on 100% of queries at a million tokens, while answer accuracy sits at 0.2%
The mechanism: softmax's fixed-pie denominator smears attention across the crowd, dropping the correct document's share of the layer's output from 91% to 1%
How multiplying attention scores by the log of corpus size — a one-line contrast knob — resurrects million-token retrieval from 0.2% to 16.5%
The existence proof: a half-billion-parameter model beats a dense retriever by 3-4x on LIMIT, a benchmark single-vector embeddings provably can't solve
The steelman catch: the best-performing fix rebuilds retrieve-then-read inside the transformer, the paper reports no latency or cost numbers, and on abstract-similarity retrieval every variant scores near zero

Chapters

00:01It knows the answer, can't say it
02:26Why kill a retriever that works?
04:07A half-billion model reads a million tokens
06:21The autopsy: was it ever confused?
09:54Can one multiplication resurrect retrieval?
12:41Who likes Joshua Trees?
13:55The tables are darker than the abstract
16:09Plumbing, not a capability wall

References in this episode

On the Theoretical Limitations of Embedding-Based Retrieval — The paper behind the LIMIT benchmark discussed in the episode, proving the theor
Scalable-Softmax Is Superior for Attention — Introduces the log-of-context-size softmax scaling ('SSMax') that the episode ca
Efficient Streaming Language Models with Attention Sinks — The streaming-stability work the failed attention-sink fix was borrowed from — u
Lost in the Middle: How Language Models Use Long Contexts — The classic empirical study of long-context degradation whose 'model gets confus

Full transcript

Also available as a plain-text transcript page.

0:00Bella: The spotlight never misses. Ten thousand documents sit in this language model's context, about a million tokens, and somewhere in its attention there is always at least one head pointing straight at the correct document. One hundred percent of queries. Then the model opens its mouth to answer, and it gets the question right zero point two percent of the time. It knows the answer. It can't say it.

0:27Eric: One fact before anything else: this is an AI-made explainer, and both of our voices are AI.

0:33Bella: By the end of this video you'll have the mechanism behind that gap: a specific piece of arithmetic inside every transformer layer that drowns the answer as the context grows, plus the two fixes that recover most of it. And this matters beyond one paper, because every serious LLM application today runs on a bolt-on vector database doing the searching. This work, from a Berkeley and UT Austin team including Sewon Min, is the first controlled test of whether the model itself can take that job.

1:07Eric: And the case for "obviously it can" is strong. Context windows hit a million tokens. Models ace needle-in-a-haystack tests. Hide a magic sentence in a wall of unrelated text and they find it every time; vendors put it on the launch slide. So skip the vector database. Dump the corpus into context, ask the model directly, let it apply judgment instead of geometry. The retriever becomes legacy plumbing.

1:35Bella: Except it collapses the moment you test it honestly, and diagnosing that collapse is the entire paper. The needle tests mislead because the needle is lexically distinctive. Finding a red marble in a bowl of white ones tells you nothing about picking the slightly redder marble out of ten thousand red ones. This paper builds its test corpora from hard negatives, documents a strong embedding model thinks look relevant. And a rival model you'll meet later aces the synthetic needle tests, then degrades on real retrieval until it falls behind a simple baseline. Passing needle-in-a-haystack tells you close to nothing about real retrieval ability.

2:17Eric: Fine, so the naive version dies. Before the collapse means anything, though: what exactly is being replaced, and why replace something that works?

2:27Bella: The incumbent works like this: when an AI assistant "searches your documents," it never reads them. A separate embedding model turns each document into a single vector, one point on a huge map, computed once, in advance. Your question becomes a point too, and the system grabs whichever documents sit closest. That's dense retrieval, the engine under retrieval-augmented generation. Fast, proven, and structurally limited, because relevance gets reduced to geometric closeness between two points. Some questions don't live anywhere sensible on that map, and there are provable limits on what one vector per document can encode. We'll meet a benchmark later that was built around exactly that theorem.

3:12Eric: Which is the real motivation, right — not convenience. A model that reads the corpus can apply whatever notion of relevance the task demands. But this idea isn't new. Google made noise about long-context Gemini subsuming retrieval two years ago.

3:29Bella: Made noise, yes — tested, no. Prior evaluations were either proprietary systems with no controlled comparison, or reranking setups where a real retriever narrows the corpus to a handful of candidates first and the model just sorts them. Reranking dodges the two hard requirements: corpora of millions of tokens, and generalizing to corpus sizes far beyond training, because nobody retrains their retriever every time the document collection grows. So the authors built the honest version themselves. And watching where it breaks is where this gets good.

4:05Eric: With a deliberately small model, too.

4:08Bella: Just over half a billion parameters: Qwen3, native context thirty-two thousand tokens, pushed here past a million, thirty times its design limit. They call their system BlockSearch. Every document goes into context wrapped with a random four-digit code, the query goes at the end, and the model answers by generating the code of the relevant document. The corpus gets processed once into a cache and reused across every query, like a new employee reading the whole company wiki on day one, then answering questions from notes. During that read, each document only attends within itself, which is what makes million-token processing affordable.

4:50Eric: The random codes sound like a throwaway detail. I suspect they're not. Prior recipes numbered the documents in order?

4:58Bella: They did, and it's fatal. A model trained on two hundred fifty-six numbered documents has never seen the ID five thousand, and it learns spurious ties between a code and a position. Randomize the codes every training step and the code becomes a pure pointer — nothing to overfit. Two more moves in one breath: one expensive corpus prefill gets shared across sixteen queries' worth of training signal, and an on-policy loss fixes a subtle bug: normal training is learning to drive by watching a perfect driver, while here the model drives, drifts, and gets corrected along its own path. The payoff: the old position-coded recipe is dead by five thousand documents. BlockSearch stays above ninety-five percent at small scale and keeps meaningful accuracy out to roughly half a million tokens, ten times anything it saw in training.

5:55Eric: And then the cliff. At ten thousand documents, the full million tokens, it drops to that zero point two percent from the open. The obvious read is capacity: half a billion parameters staring at a million tokens, and at some point it just loses track. That's the folk story behind every long-context failure. People literally call it context rot: the model gets confused.

6:22Bella: The autopsy says the folk story is wrong, and the autopsy is the densest stretch of this episode. It pays off in the strangest table in the paper: a layer that keeps writing at full volume while its content gets swapped out for noise. Three things to track, all on screen now. The raw score: for every token in context, the model computes an unbounded compatibility number — how relevant is this to my query. The softmax weight: those raw scores get converted into shares of a fixed pie that always sums to one hundred percent. And the blend: what a layer passes downstream is a weighted average of information from every token, weighted by those shares. Scores rank; softmax slices; the blend is all anything downstream ever sees.

7:12Eric: Then the diagnostic question writes itself. At a million tokens, do the raw scores still rank the gold document, the correct one, at the top? If the ranking has decayed, it's a capacity story after all.

7:26Bella: They could check directly, because they always know which document is gold. So they went head by head, layer by layer... and the ranking was intact. Not degraded — intact. At layer nineteen, at least one attention head puts the gold document's raw score first for one hundred percent of queries, at every corpus size, up to and including a million tokens. While the generated answer is right zero point two percent of the time. The internal search never fails.

7:58Eric: Then I'm missing something, Bella. If some head is holding the correct ranking, why can't a later layer just read the ranking off?

8:07Bella: Because no layer ever sees a ranking. It sees the blend, and the blend uses softmax shares, and the softmax denominator sums over every token in context, relevant or not. This is the spotlight from the cold open: attention has a fixed hundred watts of total light to hand out. In a room of five hundred documents, aimed at the right one, the target glows. Add nine and a half thousand more objects and the aim stays perfect, but the same hundred watts smears across everything, and the target gets a fraction of a watt. Now watch the figure on screen. The gold bar is the share of this layer's output that comes from the correct document. At five hundred documents: ninety-one percent. At ten thousand: one percent. And watch the total height of the output next to it. It barely moves, down about a third. The layer keeps writing into the network at full strength. It's just writing the average of ten thousand distractors instead of the answer.

9:12Eric: That last part is the nasty bit. If the signal faded, a later layer might notice something missing. Instead the layer never goes quiet. It starts confidently saying the average of everything irrelevant, at the same amplitude the answer used to have. Downstream has no cue that a swap happened. So, checkpoint: the model ranks perfectly and still fails, because...?

9:36Bella: Because winning the ranking only guarantees you the biggest slice of the pie, and at a million tokens the biggest slice is one percent. The model stays on target; the averaging drowns it. Which means the fixes shouldn't touch the model's knowledge at all. They should attack the denominator.

9:55Eric: Three candidates, and one fails in a useful way, right?

9:58Bella: The failure first: a learned attention sink, a constant added to the denominator, borrowed from streaming-stability work. It barely helps, and the reason is clean: a constant can let an unconfident layer quiet itself down, but it can't fight a denominator that grows with every document you add. The fix that works is almost embarrassingly small: multiply the raw scores by the logarithm of the corpus size before softmax. A contrast knob. The bigger the crowd, the more you amplify the gaps between scores, and log of N turns out to be just enough to cancel the crowd's growth in the denominator. Because the corpus size is plugged in at inference time rather than learned, it keeps working at sizes the model never trained on. And here's the prediction that makes it a real test: if dilution is the bottleneck — not capacity — one multiplication should resurrect million-token retrieval. It does. On MS MARCO, from zero point two percent to sixteen and a half. An eighty-two-fold recovery from a contrast knob.

11:05Eric: The second fix is blunter: clear the room. Run the first sixteen layers, use the model's own mid-network attention scores to shortlist the top two hundred fifty-six documents, and finish with only those in context. The shortlist keeps the gold document about ninety-six percent of the time. Stack both fixes and the million-token score edges past the dense retriever, twenty point five versus twenty point two, from a model running at thirty times its design limit. It also matches or beats MSA, a concurrent model seven times larger, trained on much longer contexts, the same MSA that aces the synthetic needle tests. But Bella, look at what routing is. A first stage that narrows candidates, a second stage that reads them closely. That's retrieve-then-read, rebuilt inside the model that was supposed to delete it. The authors admit this in so many words. Hold that thought.

12:10Bella: Held — and hold a second one with it, Eric: the attention ceiling is one hundred percent and the best readout recovers about twenty. These fixes mitigate dilution; they don't close the gap, and the authors say that plainly too. Still, matching dense retrieval on dense retrieval's home turf was never the endgame. The real question is whether reading can beat distance-on-a-map at something the map can't do.

12:41Eric: That's the benchmark called LIMIT, and it's built on a theorem, not a vibe: there exist combinations of relevance judgments that no one-vector-per-document scheme can represent, however good the embeddings. The test itself is almost silly. Short biographies: "Geneva Durben likes Quokkas, River Otters, Tapirs, Joshua Trees, Pansies, Soy Sauce, Cards Against Humanity and Elm Trees." Query: who likes Joshua Trees? Trivial to state, hostile to geometry, because one point per person can't encode every combination of hobbies.

13:21Bella: And BlockSearch never trained on anything lexical like this, so it's fully out of distribution. The fixed variants beat the dense retriever at every corpus size, by three to four times at the larger ones. At five thousand documents, around eight hundred fifty thousand tokens, it's point one four nine versus point zero three five. A model a fraction the size of production embedders, winning by multiples on a task embeddings are provably bad at. That's the existence proof this whole agenda needed.

13:56Eric: Now the pushback, because the abstract reads sunnier than the tables. Three things. First, "matches dense retrieval" means a size-matched, half-billion-parameter dense retriever the authors trained themselves. Production retrieval stacks run eight-billion-parameter embedding models, plus rerankers, plus lexical search; this paper actually uses one of those big embedders as its teacher. The practical gap is bigger than the headline suggests. Second, the economics. Dense retrieval answers a query with one embedding and a nearest-neighbor lookup, microseconds against billions of documents. This prefills a million-token cache and runs attention over it, and the paper reports no latency, no memory, no cost. Million-token scale sounds big until you notice web scale is around six orders of magnitude bigger, and nothing in the mechanism closes that. Third, the appendix. On a harder benchmark of abstract similarity, where the task is to find the math problem with the analogous proof technique, every variant, fixed or not, scores at or near zero. The Joshua Trees win is a lexical win. Relevance beyond similarity is still a promissory note.

15:16Bella: On the economics I just concede, Eric. The silence on cost is a real hole in anything framed as an alternative to search infrastructure, and I won't pretend otherwise. What I'll defend is the science underneath: those weaknesses bound where you'd deploy this, not what the autopsy found. The dissociation and the dilution mechanism stand whether or not this ever beats a production stack, and they hand the field a new default hypothesis for long-context failures generally.

15:49Eric: Even granting that, remember which fix carried the best numbers. Routing wins by rebuilding a retrieval pipeline inside the transformer. Read uncharitably, the paper set out to replace retrieve-then-read and ended up demonstrating you can't yet escape it. You can only relocate it.

16:10Bella: So go back to the opening pair: one hundred percent, zero point two. At the top of the episode that was a paradox. Now it reads plainly: the spotlight aims perfectly, and softmax smears its hundred watts across ten thousand documents. The bigger claim: for retrieval at least, long-context degradation turns out to be plumbing rather than a capability wall — and plumbing can be fixed.

16:37Eric: Drop your side in the comments: one model that reads everything, with attention taught to handle crowds, or retrieve-then-read forever, outside the model or smuggled back in. The full annotated version is at paperdive.ai, every term tap-to-define, with links to LIMIT, SSMax, and the related papers by theme. Housekeeping, fast: script by Anthropic's Claude Fable 5; Bella and I are AI voices from Eleven Labs; we're affiliated with neither company. The paper is "Drowning in Documents at Million Token Scale," published July first, 2026; this episode, July third.

17:16Bella: Next time a long context fails on you, which is it — a model that lost the thread, or a model that knows the answer and is drowning on the way to saying it?