All episodes
Episode 198 · Jul 03, 2026 · 17 min

The Model That Knows the Answer and Can't Say It

Gollapudi, Gupta, Singhal et al.

NLP
AI Papers: A Deep Dive — Episode 198: The Model That Knows the Answer and Can't Say It — cover art
paperdive.ai
Ep. 198
The Model That Knows the Answer and Can't Say It
0:00
17 min
Paper
Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale
Venue
arXiv:2607.01538
Year
2026
Read the paper
arxiv.org/abs/2607.01538
Also available on
Apple Podcasts Spotify

A language model reading a million ranks the correct document first on 100% of queries — and still answers correctly just 0.2% of the time. This episode dissects the first controlled test of whether an LLM can replace the , traces the failure to one piece of arithmetic that drowns the answer as the context grows, and walks through the two fixes that recover most of it. The verdict reframes '' entirely: for retrieval, failure looks like plumbing, not a wall.

What you'll take away

  • Why acing needle-in-a-haystack tests tells you close to nothing about real retrieval — and how corpora built from expose the gap
  • The autopsy result: at layer nineteen, an ranks the gold document first on 100% of queries at a million , while answer accuracy sits at 0.2%
  • The mechanism: 's fixed-pie denominator smears across the crowd, dropping the correct document's share of the layer's output from 91% to 1%
  • How multiplying scores by the log of corpus size — a one-line contrast knob — resurrects million- retrieval from 0.2% to 16.5%
  • The existence proof: a half-billion- model beats a dense retriever by 3-4x on , a benchmark single-vector provably can't solve
  • The catch: the best-performing fix rebuilds retrieve-then-read inside the , the paper reports no latency or cost numbers, and on abstract-similarity retrieval every variant scores near zero

Chapters

  1. 00:01It knows the answer, can't say it
  2. 02:26Why kill a retriever that works?
  3. 04:07A half-billion model reads a million tokens
  4. 06:21The autopsy: was it ever confused?
  5. 09:54Can one multiplication resurrect retrieval?
  6. 12:41Who likes Joshua Trees?
  7. 13:55The tables are darker than the abstract
  8. 16:09Plumbing, not a capability wall

References in this episode

Also available as a plain-text transcript page.

0:00Bella: The spotlight never misses. Ten thousand documents sit in this language model's context, about a million , and somewhere in its there is always at least one head pointing straight at the correct document. One hundred percent of queries. Then the model opens its mouth to answer, and it gets the question right zero point two percent of the time. It knows the answer. It can't say it.

0:27Eric: One fact before anything else: this is an AI-made explainer, and both of our voices are AI.

0:33Bella: By the end of this video you'll have the mechanism behind that gap: a specific piece of arithmetic inside every layer that drowns the answer as the context grows, plus the two fixes that recover most of it. And this matters beyond one paper, because every serious LLM application today runs on a bolt-on doing the searching. This work, from a Berkeley and UT Austin team including Sewon Min, is the first controlled test of whether the model itself can take that job.

1:07Eric: And the case for "obviously it can" is strong. Context windows hit a million . Models ace needle-in-a-haystack tests. Hide a magic sentence in a wall of unrelated text and they find it every time; vendors put it on the launch slide. So skip the . Dump the corpus into context, ask the model directly, let it apply judgment instead of geometry. The retriever becomes legacy plumbing.

1:35Bella: Except it collapses the moment you test it honestly, and diagnosing that collapse is the entire paper. The needle tests mislead because the needle is lexically distinctive. Finding a red marble in a bowl of white ones tells you nothing about picking the slightly redder marble out of ten thousand red ones. This paper builds its test corpora from , documents a strong model thinks look relevant. And a rival model you'll meet later aces the synthetic needle tests, then degrades on real retrieval until it falls behind a simple baseline. Passing needle-in-a-haystack tells you close to nothing about real retrieval ability.

2:17Eric: Fine, so the naive version dies. Before the collapse means anything, though: what exactly is being replaced, and why replace something that works?

2:27Bella: The incumbent works like this: when an AI assistant "searches your documents," it never reads them. A separate model turns each document into a single vector, one point on a huge map, computed once, in advance. Your question becomes a point too, and the system grabs whichever documents sit closest. That's , the engine under retrieval-augmented generation. Fast, proven, and structurally limited, because relevance gets reduced to geometric closeness between two points. Some questions don't live anywhere sensible on that map, and there are provable limits on what one vector per document can encode. We'll meet a benchmark later that was built around exactly that theorem.

3:12Eric: Which is the real motivation, right — not convenience. A model that reads the corpus can apply whatever notion of relevance the task demands. But this idea isn't new. Google made noise about subsuming retrieval two years ago.

3:29Bella: Made noise, yes — tested, no. Prior evaluations were either proprietary systems with no controlled comparison, or setups where a real retriever narrows the corpus to a handful of candidates first and the model just sorts them. Reranking dodges the two hard requirements: corpora of millions of , and generalizing to corpus sizes far beyond training, because nobody retrains their retriever every time the document collection grows. So the authors built the honest version themselves. And watching where it breaks is where this gets good.

4:05Eric: With a deliberately small model, too.

4:08Bella: Just over half a billion : , native context thirty-two thousand , pushed here past a million, thirty times its design limit. They call their system . Every document goes into context wrapped with a random four-digit code, the query goes at the end, and the model answers by generating the code of the relevant document. The corpus gets processed once into a cache and reused across every query, like a new employee reading the whole company wiki on day one, then answering questions from notes. During that read, each document only attends within itself, which is what makes million-token processing affordable.

4:50Eric: The random codes sound like a throwaway detail. I suspect they're not. Prior recipes numbered the documents in order?

4:58Bella: They did, and it's fatal. A model trained on two hundred fifty-six numbered documents has never seen the ID five thousand, and it learns spurious ties between a code and a position. Randomize the codes every training step and the code becomes a pure pointer — nothing to . Two more moves in one breath: one expensive corpus gets shared across sixteen queries' worth of training signal, and an on- fixes a subtle bug: normal training is learning to drive by watching a perfect driver, while here the model drives, drifts, and gets corrected along its own path. The payoff: the old position-coded recipe is dead by five thousand documents. stays above ninety-five percent at small scale and keeps meaningful accuracy out to roughly half a million , ten times anything it saw in training.

5:55Eric: And then the cliff. At ten thousand documents, the full million , it drops to that zero point two percent from the open. The obvious read is capacity: half a billion staring at a million tokens, and at some point it just loses track. That's the folk story behind every failure. People literally call it : the model gets confused.

6:22Bella: The autopsy says the folk story is wrong, and the autopsy is the densest stretch of this episode. It pays off in the strangest table in the paper: a layer that keeps writing at full volume while its content gets swapped out for noise. Three things to track, all on screen now. The raw score: for every in context, the model computes an unbounded compatibility number — how relevant is this to my query. The : those raw scores get converted into shares of a fixed pie that always sums to one hundred percent. And the blend: what a layer passes downstream is a weighted average of information from every token, weighted by those shares. Scores rank; softmax slices; the blend is all anything downstream ever sees.

7:12Eric: Then the diagnostic question writes itself. At a million , do the raw scores still rank the gold document, the correct one, at the top? If the ranking has decayed, it's a capacity story after all.

7:26Bella: They could check directly, because they always know which document is gold. So they went head by head, layer by layer... and the ranking was intact. Not degraded — intact. At layer nineteen, at least one puts the gold document's raw score first for one hundred percent of queries, at every corpus size, up to and including a million . While the generated answer is right zero point two percent of the time. The internal search never fails.

7:58Eric: Then I'm missing something, Bella. If some head is holding the correct ranking, why can't a later layer just read the ranking off?

8:07Bella: Because no layer ever sees a ranking. It sees the blend, and the blend uses shares, and the softmax denominator sums over every in context, relevant or not. This is the spotlight from the cold open: has a fixed hundred watts of total light to hand out. In a room of five hundred documents, aimed at the right one, the target glows. Add nine and a half thousand more objects and the aim stays perfect, but the same hundred watts smears across everything, and the target gets a fraction of a watt. Now watch the figure on screen. The gold bar is the share of this layer's output that comes from the correct document. At five hundred documents: ninety-one percent. At ten thousand: one percent. And watch the total height of the output next to it. It barely moves, down about a third. The layer keeps writing into the network at full strength. It's just writing the average of ten thousand distractors instead of the answer.

9:12Eric: That last part is the nasty bit. If the signal faded, a later layer might notice something missing. Instead the layer never goes quiet. It starts confidently saying the average of everything irrelevant, at the same amplitude the answer used to have. Downstream has no cue that a swap happened. So, : the model ranks perfectly and still fails, because...?

9:36Bella: Because winning the ranking only guarantees you the biggest slice of the pie, and at a million the biggest slice is one percent. The model stays on target; the averaging drowns it. Which means the fixes shouldn't touch the model's knowledge at all. They should attack the denominator.

9:55Eric: Three candidates, and one fails in a useful way, right?

9:58Bella: The failure first: a learned , a constant added to the denominator, borrowed from streaming-stability work. It barely helps, and the reason is clean: a constant can let an unconfident layer quiet itself down, but it can't fight a denominator that grows with every document you add. The fix that works is almost embarrassingly small: multiply the raw scores by the logarithm of the corpus size before . A contrast knob. The bigger the crowd, the more you amplify the gaps between scores, and log of N turns out to be just enough to cancel the crowd's growth in the denominator. Because the corpus size is plugged in at rather than learned, it keeps working at sizes the model never trained on. And here's the prediction that makes it a real test: if dilution is the bottleneck — not capacity — one multiplication should resurrect million- retrieval. It does. On , from zero point two percent to sixteen and a half. An eighty-two-fold recovery from a contrast knob.

11:05Eric: The second fix is blunter: clear the room. Run the first sixteen layers, use the model's own mid-network scores to shortlist the top two hundred fifty-six documents, and finish with only those in context. The shortlist keeps the gold document about ninety-six percent of the time. Stack both fixes and the million- score edges past the dense retriever, twenty point five versus twenty point two, from a model running at thirty times its design limit. It also matches or beats , a concurrent model seven times larger, trained on much longer contexts, the same MSA that aces the synthetic needle tests. But Bella, look at what routing is. A first stage that narrows candidates, a second stage that reads them closely. That's retrieve-then-read, rebuilt inside the model that was supposed to delete it. The authors admit this in so many words. Hold that thought.

12:10Bella: Held — and hold a second one with it, Eric: the ceiling is one hundred percent and the best readout recovers about twenty. These fixes mitigate dilution; they don't close the gap, and the authors say that plainly too. Still, matching on dense retrieval's home turf was never the endgame. The real question is whether reading can beat distance-on-a-map at something the map can't do.

12:41Eric: That's the benchmark called , and it's built on a theorem, not a vibe: there exist combinations of relevance judgments that no one-vector-per-document scheme can represent, however good the . The test itself is almost silly. Short biographies: "Geneva Durben likes Quokkas, River Otters, Tapirs, Joshua Trees, Pansies, Soy Sauce, Cards Against Humanity and Elm Trees." Query: who likes Joshua Trees? Trivial to state, hostile to geometry, because one point per person can't encode every combination of hobbies.

13:21Bella: And never trained on anything lexical like this, so it's fully out of distribution. The fixed variants beat the dense retriever at every corpus size, by three to four times at the larger ones. At five thousand documents, around eight hundred fifty thousand , it's point one four nine versus point zero three five. A model a fraction the size of production embedders, winning by multiples on a task are provably bad at. That's the existence proof this whole agenda needed.

13:56Eric: Now the pushback, because the abstract reads sunnier than the tables. Three things. First, "matches " means a size-matched, half-billion- dense retriever the authors trained themselves. Production retrieval stacks run eight-billion-parameter models, plus rerankers, plus lexical search; this paper actually uses one of those big embedders as its teacher. The practical gap is bigger than the headline suggests. Second, the economics. Dense retrieval answers a query with one embedding and a nearest-neighbor lookup, microseconds against billions of documents. This prefills a million- cache and runs over it, and the paper reports no latency, no memory, no cost. Million-token scale sounds big until you notice web scale is around six orders of magnitude bigger, and nothing in the mechanism closes that. Third, the appendix. On a harder benchmark of abstract similarity, where the task is to find the math problem with the analogous proof technique, every variant, fixed or not, scores at or near zero. The Joshua Trees win is a lexical win. Relevance beyond similarity is still a promissory note.

15:16Bella: On the economics I just concede, Eric. The silence on cost is a real hole in anything framed as an alternative to search infrastructure, and I won't pretend otherwise. What I'll defend is the science underneath: those weaknesses bound where you'd deploy this, not what the autopsy found. The dissociation and the dilution mechanism stand whether or not this ever beats a production stack, and they hand the field a new default hypothesis for failures generally.

15:49Eric: Even granting that, remember which fix carried the best numbers. Routing wins by rebuilding a retrieval inside the . Read uncharitably, the paper set out to replace retrieve-then-read and ended up demonstrating you can't yet escape it. You can only relocate it.

16:10Bella: So go back to the opening pair: one hundred percent, zero point two. At the top of the episode that was a paradox. Now it reads plainly: the spotlight aims perfectly, and smears its hundred watts across ten thousand documents. The bigger claim: for retrieval at least, degradation turns out to be plumbing rather than a wall — and plumbing can be fixed.

16:37Eric: Drop your side in the comments: one model that reads everything, with taught to handle crowds, or retrieve-then-read forever, outside the model or smuggled back in. The full annotated version is at paperdive.ai, every term tap-to-define, with links to , , and the related papers by theme. Housekeeping, fast: script by Anthropic's Fable 5; Bella and I are AI voices from Eleven Labs; we're affiliated with neither company. The paper is "Drowning in Documents at Million Token Scale," published July first, 2026; this episode, July third.

17:16Bella: Next time a long context fails on you, which is it — a model that lost the thread, or a model that knows the answer and is drowning on the way to saying it?