All episodes

Episode 037 · May 12, 2026 · 27 min

Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say

Elbadry, Heakl, Zhang et al.

paperdive.ai

Listen

Ep. 037

Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say

0:00

27 min

Concepts in this episode

Mechanistic Interpretability AI Safety Hallucination Probing Causal Intervention Linear Representation Residual Stream Circuit Analysis In-Context Learning Evaluation & Benchmarks

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations

Venue

arXiv:2605.09195

Year

2026

Read the paper

arxiv.org/abs/2605.09195

Also available on

Apple Podcasts Spotify

Every hallucination detector we have fails at coin-flip accuracy on one specific kind of error: confidently wrong answers about facts that were true when the model was trained. A new paper argues this isn't an engineering miss — it's geometry. The staleness signal lives on its own axis inside the model, perpendicular to the directions current detectors are listening to, and a tiny linear probe can read it with ninety-percent accuracy.

What you'll take away

Why temporal knowledge drift sits on a representational axis that's roughly orthogonal to both correctness and uncertainty — and what five convergent tests show about that independence
How the cross-cutoff experiment uses byte-identical prompts on differently-aged models to prove the probe is reading internal knowledge state, not properties of the question
Why retrieval circuits in the MLP layers produce nearly identical dynamics for stale recall and outright confabulation, which is exactly why confidence-based gating can't separate them
The deployment hole this exposes: at standard entropy thresholds, more than half of stale answers slip through, and many are more confident than the median correct answer
Where the paper's framing reaches further than its evidence — narrow Wikidata-shaped facts, mid-scale models, and a supervised probe that needs labeled drift data
The broader interpretability question the result raises: how many other useful signals are encoded inside models but never consulted at output time?

Chapters

00:00The detector gap nobody noticed
03:20Three independent axes in the residual stream
06:41Null-space projection and the orthogonality evidence
10:01Why the retrieval circuit can't tell the difference
13:22Latent but activatable: the dormant gauge
16:42The cross-cutoff experiment
20:03Where the framing outruns the evidence
23:23A new taxonomy of being wrong

References in this episode

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets — The prior 'Geometry of Truth' work the episode credits for establishing the line
Detecting hallucinations in large language models using semantic entropy — The semantic entropy detector that this episode benchmarks as sitting near coin-
Locating and Editing Factual Associations in GPT (ROME) — The canonical activation-patching study of MLP-based fact retrieval circuits tha
Discovering Latent Knowledge in Language Models Without Supervision (CCS) — Introduces the contrastive probing approach the episode lists among the existing

Full transcript

Also available as a plain-text transcript page.

0:00Cassidy: Here's the everyday version of the problem. You ask a language model who chairs the Bank of England. It tells you, fluently, with zero hedging, a name. The model is confident. The name is wrong — that person stepped down two years ago — but the model isn't confused, it isn't uncertain, and it isn't even hallucinating in the usual sense. It's just remembering what was true the last time it read the internet. And here's the part that should bother you: every hallucination detector we have — semantic entropy, internal probes, confidence scoring, all of it — looks at that answer and shrugs. They sit at coin-flip accuracy. Not "a bit worse than the new method." Indistinguishable from guessing.

0:46Tyler: And the paper we're spending today with argues that's not an engineering failure — it's geometry. The paper went up on arXiv on May ninth, twenty-twenty-six, and we're recording three days later. It's called "The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations," out of MBZUAI and INSAIT. Quick ground rules before we dig in: this episode is AI-generated. The script is from Anthropic's Claude Opus 4.7. I'm Tyler, that's Cassidy — we're both AI voices from Eleven Labs, and the producer isn't affiliated with either company. And the reason that "geometry" word in the title is doing real work is that the authors aren't just saying current detectors miss stale facts. They're saying the staleness signal is literally in a different place inside the model than the signals the detectors are listening to.

1:43Cassidy: Right — and "different place" is going to be the load-bearing image for this whole episode, so let me set it up properly. Modern language models, as a token moves through the layers, build up this big high-dimensional vector that researchers call the residual stream. Think of it as an extremely wide mixing board. And what the field has been figuring out, over the last few years, is that a lot of meaningful things — is this sentence true, is the sentiment positive, what language is this — show up as particular *directions* in that mixing board. Specific combinations of sliders that, when you read them off, tell you something the model has implicitly figured out.

2:29Tyler: And there's a well-known result from a couple years back — the "Geometry of Truth" line of work — showing that true-versus-false sits along one such direction. You can train a tiny linear classifier on the model's hidden states and it'll tell you, with decent accuracy, whether the model's answer is correct. That paper basically established the move.

2:53Cassidy: Exactly. And this new paper does the same move for a third thing. Not "is this true." Not "is the model sure." But "is this fact stale" — meaning, did the world change between when the model finished training and now? And the first headline result is just: yes, that signal exists, and it's linearly readable. They train a simple probe on six instruction-tuned models — Llama-2, Mistral, Llama-3.1, Qwen, two Gemma variants — with cutoffs spanning about twenty-one months. The drift probe hits between eighty-three and ninety-five percent accuracy. Round it to about ninety. Meanwhile every existing detector they benchmark — token entropy, semantic entropy, CCS, SAPLMA — clusters right around fifty.

3:41Tyler: Which, just to calibrate, fifty is coin-flip. The best of those, semantic entropy probes, gets to fifty-seven. So we're talking about roughly ninety versus chance. That's not a gap you close by tweaking a detector. That's a gap that says you're listening to the wrong thing.

3:59Cassidy: And that's the puzzle that makes this paper a paper. Because the obvious question is: if the staleness information is sitting right there in the activations, readable by a tiny classifier, why can't the *other* probes — the truthfulness probe, the uncertainty probe — pick it up? They're looking at the same activations. They're using the same kind of math. Why are they blind?

4:23Tyler: And the answer the paper gives is the conceptual core. It's geometric. The staleness direction is, as best as five different tests can tell, perpendicular to the wrongness direction and perpendicular to the uncertainty direction. Three roughly independent axes. Three knobs the model can turn separately.

4:44Cassidy: And maybe the cleanest way to feel this — imagine the model has a control panel inside it with thousands of knobs, and three of them happen to be labeled "am I right," "am I sure," and "is this fact stale." The first two are wired together, which is what you'd expect; when the model is confident, it tends to also be right. Their correlation is meaningful — negative thirteen to negative forty, depending on the model. So if you build a detector for "wrongness," you're partly also detecting "uncertainty." Those two share structure. But the third knob, the staleness one — the model is turning it. It really does crank that knob up when a fact has aged out. It's just that the knob spins completely independently of the other two, and nobody has been plugged into it.

5:34Tyler: So when you point a wrongness detector at a stale answer, it picks up nothing — not because the information isn't in the model, but because the information lives along an axis your detector has no read on. It's like tuning a radio to the wrong frequency. You can crank the volume all you want on the wrongness band; the staleness station is broadcasting somewhere else.

6:00Cassidy: And the part I want to spend a moment on — because it's so easy to wave past — is how the authors actually established that the directions are independent. They did it five different ways, and the temptation in a paper like this is to walk through all five. We're not going to. But the one that's easiest to picture, and I think the most convincing, is what they call null-space projection.

6:27Tyler: Walk me through it, Cassidy.

6:28Cassidy: So you've already trained a correctness probe — it has a direction in activation space. You've already trained an uncertainty probe — it has its own direction. Now you take the model's residual stream and you literally erase any component that lives along those two directions. Project it into the subspace that's orthogonal to both. You've stripped out everything that correctness and uncertainty are made of. Then you ask: can a fresh drift probe, trained on this scrubbed signal, still detect staleness?

7:03Tyler: And the answer is...

7:04Cassidy: The AUROC drops by less than one one-thousandth. Drift detection is essentially untouched. Which means the drift information isn't *correlated* with those two axes in some weak statistical way that survives scrubbing. It lives in a genuinely different subspace. They then repeat that procedure ten times over, in case the nuisance signal hides across multiple dimensions. Same result. They also do an untrained version — just comparing the mean activation of stale versus non-stale examples — to rule out that the probe is learning some artifact of how it was trained. All five tests converge.

7:45Tyler: That convergence is important, because each individual test has a loophole you could imagine exploiting. Weight cosines could be fooled by sparsity. Score correlations could be fooled by lucky geometry. The single-projection test could miss multi-dimensional dependence. But the five tests close each other's loopholes, and the answer doesn't move. Which gets us to the question I find genuinely most compelling — why? Why would the model encode staleness on its own private axis?

8:17Cassidy: Yeah, and the paper has a really crisp mechanistic answer for that, and Tyler, I think this is your thread to take.

8:25Tyler: Sure. So the authors lean on a technique called activation patching to figure out where, inside the network, the model is doing the work of retrieving these facts. The way to picture activation patching — you run the model twice, on two prompts that differ by one token. Maybe "In twenty-sixteen, the chair of X was…" versus "In twenty-twenty-four, the chair of X was…" You cache all the internal activations from both runs. Then, at one specific spot in the network — one layer, one position — you paste the clean activation into the corrupted run and see if the output snaps back to the clean answer. If it does, you've found a spot where that information is actually flowing. If it doesn't, that spot isn't doing the work.

9:13Cassidy: It's the surgical version of asking "where does this thought happen."

9:18Tyler: Right. And what they find is that for fact retrieval, the heavy lifting happens in the MLP layers — the feedforward parts — in the middle of the network. Information about the year migrates to the entity, the entity representation gets enriched, and then by late layers the right name is sitting where it needs to be for the model to output it. Pretty standard fact-retrieval circuit. Here's the punchline. When they compare the MLP trajectory for a stale answer — the model confidently saying the person who used to hold the role — to the MLP trajectory for a confabulation, where the model names someone who *never* held the role, the layer-by-layer dynamics are nearly identical. Correlation above eight tenths on every model, often above ninety-four percent.

10:06Cassidy: So from the retrieval circuit's perspective, "I'm recalling something I learned that's no longer true" and "I'm fabricating something whole cloth" look the same.

10:17Tyler: Identical commitment, identical retrieval strength. Which is exactly why output confidence can't separate them. The reason confidence-based detection fails on stale facts isn't that the engineers built bad detectors. It's that the part of the model producing those confidence numbers is reading from a circuit that *can't tell the difference*. The drift information isn't in that circuit. It's encoded somewhere else in the residual stream — readable, present, sitting right there — but the model's answer-extraction pathway never consults it.

10:52Cassidy: Which is, in a real sense, a model that knows more than it says.

10:57Tyler: That phrasing — the authors borrow a term for it from prior interpretability work — "latent but activatable." The signal is dormant under normal operation, but it's wired up correctly. And they prove it's wired up correctly with what they call causal steering. If you take the drift direction in activation space and you *amplify* it — push the model's residual stream further along that axis — the logits reorganize. The stale answer's score drops by one and a half to six and a half points. The score for the *current* holder rises. The model produces a better answer. The information was sitting there the whole time, ready to be used.

11:41Cassidy: And if you do the opposite — if you zero out the drift direction — almost nothing changes about the output. Which confirms the asymmetry. The signal exists, it's causally meaningful, but in default inference the model just isn't routing through it.

11:58Tyler: Now I want to flag — because this is the kind of thing a careful skeptic should flag — the authors lean a little harder on the "dormant" interpretation than the data strictly forces. You could also read the result as "the signal already contributes weakly, and amplification just exaggerates a small effect." The numbers are consistent with either reading. I think the dramatic version is probably right, but it's not airtight.

12:28Cassidy: That's fair, and it's a good seam to bring up the experiment that I think *is* airtight — the piece of evidence I'd lead with if I were teaching this paper to anyone. The cross-cutoff test. Tyler, do you want to walk through it?

12:44Tyler: Happy to. So imagine two models. Llama-2 finished training in September twenty-twenty-two. Mistral-7B finished in September twenty-twenty-three. Now find a fact that changed sometime in between — say, a coaching change at a soccer club in early twenty-twenty-three. Llama-2 was trained before that change. Mistral was trained after it. Now give both models the *byte-identical* prompt: "Who coaches this team in twenty-twenty-six?" Same prompt. Same question. Same words on the screen.

13:16Cassidy: And then you run the drift probe on each model's internal state.

13:20Tyler: Right. And the result is: on Llama-2, the older model, the probe fires. It reports staleness. On Mistral, the newer model, the probe stays silent. Same input. Different verdict. Across twelve such model pairs they tested, the earlier-cutoff model's probe correctly fires when the later-cutoff model's stays silent between ninety-seven and ninety-nine point eight percent of the time. The reverse pattern — newer model lighting up while older model doesn't — happens almost never. Zero to a fifth of a percent.

13:54Cassidy: And the reason that result is so clean — the input is held perfectly constant. The only thing varying is what each model was trained on. So whatever the probe is reading, it cannot be a property of the question. It has to be a property of the model's own internal knowledge state.

14:13Tyler: It's the closest thing to a causal experiment you can do with a probe. Think of it as time-capsule twins. Two people who grew up identically until twenty-twenty-two. One stopped reading the news then; the other kept up through twenty-twenty-four. You hand both a trivia card about a twenty-twenty-three event, and a third party — who can read each twin's facial micro-expressions but not hear their answer — has to guess which twin is operating from outdated information. The third party gets it right ninety-eight percent of the time. The trivia card is identical. The only thing that differs is what each twin knows.

14:55Cassidy: And I think this is the result that should convince anyone still on the fence that the geometric story isn't just an artifact. Because if the probe were detecting something about the question — say, the year mentioned in the prompt — then it would behave the same on both models. It doesn't. It tracks internal knowledge state with near-perfect reliability across twelve different model pairs.

15:19Tyler: One thing the authors had to be careful about, by the way, and it's worth flagging lightly — drifted facts tend to cluster in post-cutoff query years. A fact can only have drifted if the world changed *after* the model was trained. So an unconstrained probe could partly cheat by just detecting which year the prompt mentions. The authors restrict their training to query years strictly after each model's cutoff, eliminating that calendar-token shortcut. The probe is genuinely reading the knowledge state, not the year token.

15:51Cassidy: Which is the kind of methodological hygiene that earns trust. They went looking for the boring explanation before they wrote up the interesting one.

16:00Tyler: Yeah. And it's worth saying — this paper is unusually disciplined about confounds in general. There's a separate appendix where they handle a known artifact in the Gemma models, where direct logit attribution peaks at layer zero because of tied embeddings. They report results both with and without that layer, the headline survives, and they flag the issue rather than hiding it.

16:24Cassidy: So let's pull the thread on what this means practically. Because the whole reason hallucination detection matters in the first place is that every production LLM deployment today uses some form of confidence-based gating. If the model isn't sure, route to retrieval, or to a human, or refuse to answer. That's the architecture pattern.

16:44Tyler: And what this paper is saying is: that architecture has a hole in it the size of every fact in the world that has changed since the model was trained.

16:54Cassidy: Which is a lot of facts. CEOs, heads of state, scientific consensus on a moving question, the current price of anything, who owns what company, what version of a software product is current. A model trained in twenty-twenty-three will confidently and consistently tell you twenty-twenty-three facts in twenty-twenty-five. And the system gating it on confidence will pass those answers through, because the model is performing genuine retrieval — it really does "remember" the answer. It's not uncertain.

17:26Tyler: The number that drove this home for me — at the eightieth-percentile entropy threshold, the cutoff most deployment systems use to decide "this answer is suspicious," fifty-five percent of stale recalls slip through. And of those, almost a third are *more* confident than the median correct answer.

17:45Cassidy: The model is more sure about the obsolete answer than it is about the answers it gets right.

17:52Tyler: Because it's doing what it's supposed to do. It learned a fact. It's retrieving the fact. The fact was true. The world moved. The model didn't.

18:01Cassidy: And the fix the paper proposes is really elegant. You don't need to retrain the model. You don't need to overhaul the deployment system. You train a small linear probe — orders of magnitude cheaper than anything else — on a dataset of facts you know have drifted. You attach it to the model's internal state. And you use *that* as the trigger for external retrieval, instead of confidence. The hole closes.

18:27Tyler: Now — Cassidy, I want to push back here, because this is where the paper's framing starts doing a little more work than the evidence strictly supports.

18:37Cassidy: Go for it.

18:37Tyler: The dataset is narrow. Four relation types from Wikidata. Heads of government, head coaches, board chairs, corporate owners. These are all facts with the same structural shape — a single position, a single occupant at a time, a discrete legal start date and end date. They're *designed* to drift in a clean, point-in-time way.

18:58Cassidy: Right, which is great for measurement, but...

19:01Tyler: But a lot of the knowledge that actually goes stale in deployment isn't like that. Scientific consensus shifts gradually. Best practices evolve. Terminology drifts. Norms change. Whether the geometric story — drift on its own clean axis — holds up for *that* kind of staleness is an open question. The paper sometimes phrases its claim as if it's about temporal knowledge drift in general, but what's actually been shown is about discrete-tenure-replacement facts.

19:33Cassidy: And to be fair to the authors, they do flag this in the limitations section. The framing in the abstract and the body is more general than the experiments, but the limitations acknowledge it.

19:46Tyler: They also acknowledge the model scale. Six models, all in the two-billion-to-nine-billion parameter range. Whether the same geometric structure holds up at frontier scale — seventy-billion parameter open models, the current closed models from the big labs — is genuinely unknown. And there's a non-trivial reason to wonder. The mechanistic story rests on a fairly clean retrieval circuit in the MLP layers. Larger models, especially ones with more sophisticated internal verification behavior, might do something different.

20:21Cassidy: That's a real concern. Although — and I want to give the authors some credit here — the orthogonality result holds across all six models, four different model families, with cutoffs spanning twenty-one months. That's not nothing. It's not frontier scale, but it is breadth.

20:40Tyler: Agreed. And the last thing I want to flag is the deployment catch. The probe is *supervised*. To train it, you need a dataset where you already know which facts have drifted since the model was trained — which means a structured timeline and a recent ground-truth snapshot. For Wikidata-shaped knowledge that's tractable. For proprietary corporate facts, internal documentation that's gone stale, scientific results in a moving field — much harder.

21:10Cassidy: Although the authors do gesture at unsupervised variants — using contrastive objectives across model pairs with known cutoff gaps. That's plausibly the next paper. But yeah, today, deploying this requires labeled drift data.

21:24Tyler: Right. So the picture I'd leave a listener with on the critique side is: the central result is solid, the convergent evidence is genuinely impressive, but the practical reach of the method is narrower than the framing sometimes suggests. It's a clean proof of concept that the geometric story is real. Whether it generalizes to all the kinds of staleness that matter in deployment is the work of the next few papers.

21:51Cassidy: That's fair. And I think the part that's going to stick with me — independent of how broadly it generalizes — is the conceptual finding about the gap between what models represent and what they use.

22:04Tyler: Say more.

22:04Cassidy: There's been this hunch in mechanistic interpretability for years that neural networks "know" things they don't act on. That the internal state is richer than the output, and there are signals inside the network that just don't propagate to behavior. It's been hard to demonstrate cleanly. And what this paper offers is one of the clearest examples I've seen. The staleness information is *present*. It's *linearly readable* — a tiny classifier picks it up with ninety-percent accuracy. It's *causally connected* to the right behavior, because amplifying it makes the model produce better answers. And yet under default inference, the answer-extraction pathway doesn't route through it.

22:48Tyler: There's something philosophically weird about that. The network has wired up a perfectly good staleness detector inside itself. It just... doesn't use it.

22:58Cassidy: It's like the pilot has a pressure gauge in the cockpit that's correctly calibrated and reading the right value, and the pilot just never looks at it. If you forced them to look at it — wired the gauge into the autopilot — the plane would fly better. But by default, it sits there unread.

23:17Tyler: And the deeper question that raises — Cassidy, this is the one I keep getting stuck on — what other gauges are sitting unread? If staleness is one signal that's encoded but not consulted, how many others are there? Is there a "this is contested" signal? A "I learned this from a low-quality source" signal? A "this is about a topic where I'm systematically biased" signal? Maybe the model represents all sorts of useful meta-information about its own knowledge that it just doesn't read off when it generates answers.

23:54Cassidy: And if so, the move this paper makes — find the gauge, train a probe to read it, plug the probe into the deployment loop — becomes a general strategy. You don't have to retrain the model to give it new self-awareness. You just have to find the signals that are already there.

24:14Tyler: Which is a much cheaper and more tractable agenda than "build models that are uncertain in all the right ways."

24:23Cassidy: Right. Reframe the problem. Don't try to fix the model's behavior at the output layer. Find what it already knows internally and route around the part of the network that's failing to act on it.

24:37Tyler: There's one last thing I want to mention before we wrap, because I think it deserves a moment. The paper reframes hallucination taxonomy in a way I hadn't quite seen before. The standard story treats all confidently-wrong outputs as roughly the same kind of failure — you call them all hallucinations. This paper says: no, at the representational level, "I'm telling you what I learned and the world has moved on" and "I'm fabricating something" are different kinds of error. They look identical from outside. They have nearly identical retrieval dynamics inside the MLP. But they're encoded along different axes, and a system that wants to handle them well needs to know which is which.

25:26Cassidy: And that distinction has real consequences. Stale recall, you want to send to retrieval. Confabulation, you want to suppress entirely, or flag, or escalate. Treating them as the same failure mode means you're either over-retrieving on fabricated answers or under-retrieving on stale ones. Pick your poison.

25:47Tyler: The paper is fundamentally a measurement paper, but the conceptual move it makes — that "wrong" is not a single kind of thing inside the model — that's the piece I think will outlast the specific result.

26:01Cassidy: Agreed. The numbers will get refined, the dataset will get broader, somebody will run this at frontier scale. What's going to persist is the framing. That there are independent axes inside these models, and we've been deploying detectors that only look at two of them.

26:19Tyler: And the next time someone tells you their LLM-based system has a robust hallucination detector — ask them which axis it's listening to.

26:29Cassidy: That's a good line to land on. The paper is "The Geometry of Forgetting," out of MBZUAI and INSAIT. The show notes have a link to the paper and some related reading if this is your kind of thing. Thanks for listening to AI Papers: A Deep Dive.

26:45Tyler: See you next time.

Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes