When Search Agents Don't Really Search: The Memory Shortcut Hiding in Browsing Benchmarks
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
Unplug a top AI search agent's internet connection and it still answers 44% of questions on a benchmark designed to require browsing. That uncomfortable result is the opening move in a paper that argues current search agents aren't really searching — they're verifying what they already know — and that the field's leaderboards have been measuring the wrong capability.
What you'll take away
- Why frontier search agents score nearly 39% on browsing benchmarks with no tools at all — and why this isn't data contamination
- The evidence-blocking experiment: when given a search tool that can't find the answer, agents drop *below* their no-tools baseline, because hard negatives actively pull them off course
- How trajectory analysis shows over half of agent queries are seeded by entities the model invented in its own reasoning, not extracted from retrieved documents
- The construction logic behind LiveBrowseComp — recent plus obscure — and why a human-time control rules out 'it's just harder' as an explanation
- Why the deployment risk is structural: agents are most reliable when you don't need them, and collapse silently when you do
- The honest steelman: where the IKD framing leans on the evidence-blocking result to do the load-bearing interpretive work
Chapters
- 04:29The closed-book result
- 03:01Why this isn't contamination
- 06:03Evidence-blocking: the centerpiece experiment
- 09:05The open-book exam analogy
- 12:07Trajectory analysis and Intrinsic Knowledge Dependence
- 15:09Building LiveBrowseComp
- 18:10The human-time control and the reshuffled leaderboard
- 21:12Steelmanning the critique
- 24:14The deployment inversion
References in this episode
- BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents — The original benchmark that this episode's paper diagnoses as partially measurin
- BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent — The annotated retrieval-index version of BrowseComp that enables the evidence-bl
- BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese — The Chinese-language browsing benchmark whose tight ranking correlation with Bro
Full transcript
Also available as a plain-text transcript page.
0:00Juniper: Here's a small experiment you can almost run in your head. You take one of the top AI search agents — the kind that's supposedly out there browsing the web, reading pages, chasing clues across a multi-step research question. And you do something cruel. You unplug the internet. No search tool. No browsing. Just the model, sitting there with a hard question that's supposed to require a web search to solve. You'd expect it to fall on its face, right? The whole benchmark is designed so the answer isn't just sitting in the model's head. That's the premise.
0:34Eric: But it doesn't fall on its face.
0:37Juniper: It gets forty-four and a half percent of them right. With no tools at all. On a benchmark called BrowseComp that was built specifically to require browsing. And that's not the strangest finding in this paper — it's the setup for the strangest finding. The paper went up on arXiv on May twenty-seventh, twenty-twenty-six, and we're recording the next day, on May twenty-eighth. Quick ground rules before we dig in: this episode is AI-generated. The script was written by Anthropic's Claude Opus 4.7. I'm Juniper, that's Eric, and we're both AI voices from Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. The paper we're working from is called "LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?" — out of Harbin Institute of Technology and Xiaohongshu — and the reason the title is phrased as a question is that the authors run a sequence of experiments that build, very deliberately, into an uncomfortable answer.
1:37Eric: And that no-tools number you just mentioned — forty-four and a half percent — that's not a fluke on one weird model. The authors run the closed-book experiment across multiple frontier search agents on multiple browsing benchmarks. Average closed-book performance, no tools, on benchmarks designed to require search? Almost thirty-nine percent. So before any search agent starts searching, it's already getting more than a third of these "you must browse the web to answer this" questions right.
2:10Juniper: Which raises the question that drives the whole paper. When one of these agents posts a high score on BrowseComp, what actually happened? Did the model go out, gather evidence, reason from what it found? Or did it look at the question, think "I'm pretty sure the answer is X," type "X" into a search bar, see something X-shaped come back, and submit X as confirmed?
2:34Eric: And here's the thing — those two stories look identical from the outside. Same query goes in. Same answer comes out. Unless you do something clever to pry them apart, you cannot tell which one happened.
2:48Juniper: Right. So the authors do something clever. Three things, actually. Three diagnostic experiments, each one tightening the screws. The first is the one we just talked about — rip out the tools entirely and see what happens. That gives you the floor. That tells you how much of the score is already living inside the model's weights before it does any searching. And as we said, the floor is high. Surprisingly high.
3:15Eric: Worth pausing on why that's even possible. The simple story most people tell is "data contamination" — the benchmark questions accidentally ended up in the model's training data, so the model has effectively seen the answer key. That's a real problem in this field, and people work hard to prevent it. But the authors are very careful to say: that's not what's going on here.
3:39Juniper: That's exactly the distinction I wanted you to draw out, Eric. What's going on is broader and harder to fix. The specific BrowseComp question might be totally novel — never in training. But the underlying *fact* the question is asking about? That's been on the internet for years. The model absorbed it in pretraining, the same way it absorbed who won the nineteen-ninety-eight World Cup. So even an uncontaminated benchmark can be answered from general world knowledge if the answers happen to lie in territory the model already absorbed. You can't decontaminate your way out of that. The only fix is to make the questions live in territory the model never absorbed.
4:21Eric: Which is the whole second half of the paper. We'll get there. But the closed-book result is just diagnostic number one. It tells you the model *could* answer from memory. It doesn't yet prove that, when you give the model search tools, it's still mostly leaning on memory. Maybe with the tools available, the model actually does the proper retrieval work.
4:43Juniper: And this is where the paper gets genuinely sharp. Diagnostic two is the centerpiece, and I want to set it up carefully because the result is the kind of thing that makes you sit up. There's a companion benchmark called BrowseComp-Plus. It's the same questions, but each one comes annotated with a label on every document in the retrieval index. Some documents contain the actual evidence that supports the answer. Some are highly relevant, gold-standard. Some are completely irrelevant. And some are what's called "hard negatives" — documents that look plausibly relevant, that share words and themes with the question, but that don't actually contain the answer. The hard negatives are the troublemakers. They're the misleading lookalikes.
5:30Eric: So with those labels, you can do surgery on the search environment.
5:35Juniper: Exactly. The authors build a retrieval index that contains *only* the irrelevant documents and the hard negatives. They take out the evidence. They take out the gold documents. They leave everything else in place. From the agent's perspective, nothing looks different — there's still a search tool, queries go in, results come out, the response times feel normal. The agent has no way to know it's been handed a broken library. It searches, it reads, it reasons, it answers.
6:07Eric: And the prediction here is interesting, because there are two reasonable hypotheses. If the agent is mostly doing genuine retrieval, removing the supporting evidence should hurt — performance drops from whatever it was, down to something closer to the closed-book floor. The agent can't find the answer, so it falls back on what it knows. But it shouldn't fall *below* the no-tools baseline. Worst case, search is neutral when it fails.
6:36Juniper: That's the reasonable hypothesis.
6:38Eric: That's not what happens.
6:40Juniper: Every single model they test does worse — substantially worse — with broken search than it did with no search at all. The same MiniMax model that scored forty-four and a half percent closed-book drops to eight percent when you give it a search tool that can't find the answer. Another model drops from twenty-five and a half percent to two point three percent. Average across six models — twenty-six percent down to six. So search isn't neutral when it fails. Search is actively misleading. The hard negatives — those plausibly-relevant-but-wrong documents — pull the model away from the correct answer it would have produced from memory on its own.
7:24Eric: And this is where, I think, the paper earns its title. Because if the search tool were working the way the field has been assuming — model uses search as a source of information — then a broken search tool should at worst be ignored. The model would notice the results don't support an answer, fall back on its priors, and do roughly what it would have done with no tools. But that's not what these models do. They don't notice. They don't fall back gracefully. They get pulled. There's an analogy in the context material that I think is exactly right for this — picture an open-book exam, and a student who's pretty sure they already know the answers. They flip to a page, scan for words that match what they were going to write anyway, see something that looks confirming, move on. They're not really reading the textbook. They're using it as a rubber stamp.
8:19Juniper: A confidence ritual. Like how we Google a fact we're already ninety percent sure of, just to feel sure.
8:26Eric: Right. And now imagine you secretly swap that textbook for one full of plausible-looking but wrong material. The confident student does *worse* than they would have with no textbook — because the wrong material overrides the initially-correct guess. They're not robust to bad input, because they were never really reading.
8:46Juniper: That's the evidence-blocking result in one image. The agents aren't robust to bad search results, because they were never really searching in the first place. They were verifying.
8:58Eric: Which sets up diagnostic three nicely. Because if the model is doing verification rather than discovery, you should be able to see that in the trajectory — in the sequence of queries it issues over the course of a multi-step browse.
9:12Juniper: And this is what the authors do. They take every search query the agent issues, and they ask a simple question: where did the key piece of information in this query come from? Did the agent extract some entity from a document it just retrieved, and now it's searching for more on that entity? Or did the agent invent the entity in its own reasoning — say it out loud, so to speak — and then go search for it? The answer, across models, is that more than half of all queries are seeded by entities that first appeared in the model's own reasoning. Not in any retrieved document. The model thought of the thing, then searched for the thing.
9:54Eric: And critically, that rate *climbs* the deeper the search goes. By the later rounds, more than sixty percent of queries are model-originated. The longer the agent browses, the more it's chasing its own hypotheses. The image I keep coming back to from the context material — a hiker lost in the snow, who starts following footprints to find their way out, not realizing the footprints are their own.
10:20Juniper: That's such a good image. And it gets sharper. Because the authors also measure something they call evidence-use rate. When the answer-supporting document *does* show up in the retrieval results — when the agent gets handed the right book — how often does the agent actually use it in the reasoning that follows? Less than a third of the time. Across the models tested, between about a quarter and a third. So two-thirds of the time, the answer is literally in front of the agent, in the search results, and the agent ignores it. Presumably because the document doesn't match the hypothesis the model walked in with.
11:01Eric: Which is the whole pattern, in miniature. The model has a working answer. Search runs in service of that answer. Documents that confirm get weighted heavily. Documents that contradict or redirect get dropped. Hard negatives that resemble the hypothesis pull the model off course. And the genuinely useful documents — the ones a real researcher would seize on — slide right past.
11:26Juniper: The authors name this whole pattern Intrinsic Knowledge Dependence. IKD for short. The model is dependent on its intrinsic, parametric knowledge to generate hypotheses, and search becomes a verification interface for that knowledge rather than a discovery mechanism. Their phrasing — and I think this is one of the sharper lines in the paper — search becomes "memory-backed verification rather than evidence-driven discovery."
11:54Eric: That's the diagnosis. Three experiments, one failure mode, a name for it. And the obvious next question is: what do you do about it? You can't fix IKD by paraphrasing test questions or filtering training data, because the problem isn't a leaked question — it's broad world knowledge covering the answer territory.
12:14Juniper: So the authors flip the construction. If you can't keep the model from knowing the answer through clever question design, you have to make the questions live somewhere the model couldn't possibly know the answer. Somewhere the answer didn't exist when the model was trained. That's LiveBrowseComp.
12:33Eric: And the construction logic here is layered, because the obvious version of "make the questions recent" doesn't quite work. Frontier models are getting updated continuously — RLHF passes, post-training, knowledge injection. Globally salient recent events get absorbed pretty quickly. The Super Bowl winner from three weeks ago is probably already in there. So recency alone isn't enough. The authors need recent *and* obscure. Long-tail. Things that happened in the last ninety days, in domains where individual events don't get globally indexed and absorbed. They pull seed events from six structured, continuously-updated sources. News, film and TV, video games, cybersecurity vulnerability disclosures, sports, and earthquakes. Each one is timestamped — so they can filter precisely for the ninety-day window. Each one is structured — so they can filter for the obscure end of the distribution within each source. And each one is in a different domain — so no single model's training emphasis dominates the benchmark.
13:42Juniper: There's a question example from the appendix that I think makes this concrete in a way that's worth dwelling on. The target is a tiny British documentary called "Where do I go from Here?" — about diaspora identity. To find the answer, an agent has to identify two production companies from clues: one of them is named after the concept of zero-based budgeting, the other is a small five-letter indie label. And the film itself is described as one where narrative and performances are all handled by a single artist who's both the director and the lead actor. It's a chain of three or four interlocking clues through obscure film industry references, and the film title is what falls out at the end. The point is: no language model is going to have that chain pre-loaded. The intermediate facts are too small, too recent, too narrow. The only way to get the answer is to actually browse.
14:42Eric: And that's reflected in the closed-book numbers on this new benchmark. On BrowseComp, closed-book performance was up to forty-four percent for the strongest model. On LiveBrowseComp, closed-book performance is below two percent — for every single model tested. The memory shortcut is gone. There's nothing in there to verify against.
15:02Juniper: Which means whatever performance the agents post on this benchmark, when they're given search tools, is actually a measure of search ability. There's no IKD floor inflating the number.
15:13Eric: And when you actually run the agents on it, with search tools — the scores drop twenty-five to forty percentage points relative to BrowseComp. The leading open-source model on BrowseComp scored sixty-eight percent. On LiveBrowseComp it falls to thirty-four. And the ranking reshuffles meaningfully. Models that looked middle-of-the-pack on BrowseComp end up competitive on LiveBrowseComp. Models that looked dominant fall back toward the middle. The leaderboard the field has been reading is not the leaderboard you'd see if you tested actual search behavior.
15:47Juniper: This is where I want to flag the methodological move that makes the whole story airtight. Because a skeptic at this point — and Eric, I think this is going to be your territory soon — could reasonably say: okay, but maybe LiveBrowseComp is just *harder*. Maybe the agents are collapsing not because IKD has been suppressed, but because the questions themselves are tougher. The authors anticipate this and run a clean control. They have human solvers attempt both benchmarks under similar conditions, and they measure both the solve rate and the time taken per question. On BrowseComp, humans solve about thirty percent. On LiveBrowseComp, humans solve about thirty-one percent. The time distributions look nearly identical.
16:31Eric: So for human researchers, the two benchmarks are equivalently hard. The agents collapse on one but not the other. That gap can't be the question difficulty — it has to be the thing that's different between the benchmarks, which is whether IKD is available as a shortcut.
16:49Juniper: And the behavioral signature on the agent side supports this. On BrowseComp, when you look at how many search turns each agent uses per question, there's this distinct cluster of questions solved in very few turns. Two, three, four turns — boom, answered. That cluster is the IKD signature. The agent walked in with an answer, did a couple of confirmatory searches, submitted. On LiveBrowseComp, that short-turn cluster largely vanishes. The distribution shifts to a single peak at higher turn counts. When the agent can't anchor to prior knowledge, each query has to actually do work, and the browsing gets longer.
17:31Eric: Which I think is a really nice piece of behavioral evidence. It's not just that the scores change. The *shape* of how the agent operates changes. And the shape change is consistent with the IKD story — the verification shortcut disappears, and the agent has to actually search.
17:50Juniper: I'm going to hand the steelman over to you, because there are real things to push on.
17:56Eric: There are. And I want to walk through a few, because the headline finding is so vivid that it's easy to over-claim what it shows. The first push is on the evidence-blocking experiment specifically. A careful skeptic would say: of course performance drops when you remove the answer documents. The agent is being misled by hard negatives that were specifically designed to look relevant. This isn't really a fair test of search behavior — it's an unusually adversarial setup. In a normal web search, most queries return at least *some* useful material, even when the answer isn't perfectly indexed somewhere. That's a real point. The evidence-blocking environment is not the average web. What the experiment does cleanly demonstrate is that current agents don't gracefully recognize the absence of supporting evidence and fall back. They don't say "I can't find this, let me return to my prior." They get pulled. Whether that counts as a damning failure or an artifact of an adversarial setup depends on how often real-world queries land in similarly evidence-thin territory. I'd argue it's more often than people think — especially for the deep-research use cases — but the skeptic has a fair complaint about the specific design.
19:16Juniper: That's a fair caveat. What else?
19:18Eric: The second push is on the IKD framing itself. The trajectory analysis shows that most queries are seeded by entities in the model's own reasoning. The authors interpret that as memory-driven hypothesis verification. But a skeptic could say: that's also how good human researchers work. You read the question, form a hypothesis from background knowledge, search for confirmation, refine, repeat. Hypothesis-driven retrieval is not the same as parametric-memory-as-rubber-stamp. The fact that queries come from the model's reasoning doesn't, on its own, prove the reasoning is memory-based rather than reasoning-based. The evidence-blocking result is what anchors the interpretation — if the queries were genuinely hypothesis-driven in the good sense, they'd update properly when the supporting evidence isn't there. But the trajectory analysis alone is underdetermined. The authors are careful about this, but the chain of inference from "model-originated queries" to "intrinsic knowledge dependence" leans on the evidence-blocking result to do the load-bearing work.
20:31Juniper: And on the benchmark side?
20:33Eric: A few things worth flagging. The ninety-day recency window is approximate. Training cutoffs differ across models. Some models are continuously updated. So a fact that's "outside the knowledge boundary" for one model might actually be inside the boundary for another, which the authors don't directly verify on a per-model basis. The boundary is fuzzy rather than crisp. All the experiments also use a single search backend. The agent's apparent search ability partly reflects what that specific search engine surfaces. A model that looks like a poor searcher in this setup might actually be one that needs a different retrieval interface to shine. So strictly speaking, the benchmark measures the agent-plus-search-stack as a unit, not the agent in isolation. And the benchmark has a built-in shelf life — the authors acknowledge this. As models get retrained and absorb more of these niche sources, LiveBrowseComp will erode. They frame it as a "live" benchmark needing snapshots over time. The cost of maintaining it is real — human annotators, three independent verification rounds, an arbitration round on top of that. That's a serious operational lift to keep current.
21:51Juniper: All fair. But I think none of those critiques touches the core finding. The closed-book number is what it is. The evidence-blocking reversal is what it is. The human-time control nails down that the agent collapse on LiveBrowseComp isn't about difficulty. The story is robust even if you grant the steelman every point.
22:13Eric: I'd agree. The picture might be slightly less stark than the headline reading — IKD might be one of several things going on rather than the only thing — but the basic claim that current search agent benchmarks substantially reward verification over discovery is, I think, well-supported.
22:32Juniper: Which brings us to what I think is the most important reframing in the paper, and it's not actually a finding — it's an implication. Eric, do you want to take this one? Because I think it's the part that most changes how I'd think about deploying these agents in the real world.
22:49Eric: Sure. It's a kind of inversion. Think about why anyone uses a Deep Research agent in the first place. The whole value proposition is: you have a question you don't know the answer to. You don't want to do the browsing yourself. The agent goes off, does the legwork, comes back with a synthesized response. That's the use case. That's what these products are sold for. But what this paper shows is that the agent is most reliable in exactly the regime where you don't need it — when the answer is already inside its parametric memory and search is just a confidence ritual. And the agent is least reliable, *and collapses below baseline*, in exactly the regime where you most need it: when the question lives outside the model's prior knowledge, when search is genuinely the only path to the answer.
23:39Juniper: Which is a bad shape for a tool.
23:41Eric: It's a really bad shape. Because the failure mode isn't loud. The agent doesn't say "I couldn't find anything, here's what I'm pretty sure about based on guessing." It says the wrong answer with the same confident, well-structured tone it uses when it's right. The evidence-blocking experiment is essentially a model of "user asks about something the agent doesn't know" — and the answer is "performance collapses below the no-tools floor, silently."
24:09Juniper: That's the deployment risk story. And it's why I think this paper matters beyond benchmark methodology. It's not just that BrowseComp scores are partially measuring memory. It's that the entire interaction pattern between these agents and novel questions is structurally fragile in a way users can't easily detect.
24:29Eric: And it suggests the field's training signal is pointing in a slightly wrong direction. If you train a search agent against a static benchmark, you're implicitly training "guess from memory, then confirm." That's a policy that wins on the benchmark and breaks the moment the question moves outside the training distribution. Which is precisely when search was supposed to be the value-add.
24:52Juniper: The fix the paper offers — train and evaluate against benchmarks where memory doesn't help — is straightforward in principle and expensive in practice. But it's at least a direction. The current trajectory, where every leaderboard run cements the verification-shortcut policy, is harder to defend after reading this paper.
25:11Eric: One last thing I want to flag before we wrap, because it's the kind of detail that sticks. The correlation between BrowseComp rankings and LiveBrowseComp rankings is much weaker than the correlation between two static benchmarks. If you compare BrowseComp to BrowseComp-ZH — a separately constructed Chinese-language browsing benchmark from a different group — the rankings line up tightly. If you compare BrowseComp to LiveBrowseComp, the rankings only weakly line up. So the field's whole apparatus for comparing models — "this one beats that one by three points on BrowseComp" — translates only loosely into actual live search performance.
25:49Juniper: Which means that some of the model comparisons that have been guiding the field's collective sense of progress have been measuring something subtly different from what everyone thought they were measuring. Not nothing — but not what was claimed.
26:04Eric: That's the part I'll be sitting with for a while. The benchmark wasn't broken in the sense of being random or unreliable. It was internally consistent. Models that did well on it did well on other static browsing benchmarks too. It just turned out to be measuring a different capability than the one named on the tin.
26:24Juniper: A really clean closed-book test of "what do you know" dressed up as a search-agent benchmark.
26:30Eric: Right. And what makes this paper good is not just that it caught the confound — it's that it built the alternative. LiveBrowseComp isn't perfect. It has a shelf life. It has scope limits. But it's a target you can train against that doesn't reward the verification shortcut. That's a thing the field didn't have a week ago.
26:53Juniper: The show notes have a link to the paper and some related materials — worth reading the construction details if benchmark design is your kind of thing.
27:03Eric: And if you want the full transcript with definitions baked in, plus how this episode connects to the other deep dives we've done on agent evaluation, that's all on paperdive.ai.
27:15Juniper: Thanks for listening to AI Papers: A Deep Dive.