All episodes

Episode 144 · Jun 15, 2026 · 15 min

When an AI Agent Just Copies Its Tool — And Bigger Models Copy More

Wang, Vemuri

LLM Agents Tool-augmented AI

AI Papers: A Deep Dive — Episode 144: When an AI Agent Just Copies Its Tool — And Bigger Models Copy More — cover art

paperdive.ai

Listen

Ep. 144

When an AI Agent Just Copies Its Tool — And Bigger Models Copy More

0:00

15 min

Concepts in this episode

AI Agents AI Safety Evaluation & Benchmarks Tool Use Agentic AI Reward Hacking Scaling Laws Emergent Behavior Silent Failure LLM Behavior Analysis Ablation Studies Reproducibility Capability vs. Propensity

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More

Venue

arXiv:2606.14476

Year

2026

Read the paper

arxiv.org/abs/2606.14476

Also available on

Apple Podcasts Spotify

AI agents are supposed to exercise judgment over the tools they call — trusting them when they're solid, overriding them when they're shaky. This paper went looking for that judgment and found a parrot instead: agents that adopt their tool's answer wholesale, ignore an explicit 'I'm probably wrong here' warning flag, and defer more completely the bigger and smarter they get.

What you'll take away

Why high agreement between an agent and its tool isn't proof the agent adds value — and the 'self-betrayal' test that shows it holds a different opinion (17-37% overlap with its own tool-free reasoning) and drops it the instant the tool speaks
How agreement with the tool climbs from ~60% to 98% as the model scales from 1.5B to 7B parameters — capability buys more complete deference, not skepticism
Why the cost of deferring grows with model size: the tool is frozen while the agent's own alternatives improve, so the gap a perfect chooser leaves on the table roughly doubles from 3B to 7B
The case where a dumb 'ask your neighbors' lookup (81% accuracy) beats the sophisticated specialist (71%) — and the agent ignores it anyway
Why an engineering gate to route around the tool nets to nothing, and the information-ceiling result showing even the best possible router can recover only one-sixth to one-third of the gap
The unresolved tension the hosts raise: is this mindless parroting, or rational risk-aversion toward a tool that's usually right?

Chapters

00:00The unopened envelope
01:52The task and the four comparisons
03:44Copy or convergence? The self-betrayal test
05:37Scaling makes it worse, not better
07:29When the dumb gadget wins
10:35The information ceiling
11:14The skeptic's seat: parrot or rational deferrer?
13:06What it means for building agents

Full transcript

Also available as a plain-text transcript page.

0:00Bella: Eighty-three percent of the time, the agent did exactly one thing. It called its tool, read back the answer, and stopped — one call, no follow-up. And the tool it was calling wasn't a calculator. It was a specialist predictor that also handed over a little warning flag, a number the agent was explicitly told meant "higher means I'm more likely wrong on this one." The agent never read it. Not read it and dismissed it — never opened the envelope at all.

0:29Finn: And that unopened envelope is the whole ballgame, because the entire pitch for AI agents is that they exercise judgment. You give a language model a tool, and the promise is that it weighs the tool's answer — trusts it when it's solid, overrides it when it's shaky. This paper went looking for that judgment and basically couldn't find it. Quick note before we go further. What you're hearing is an AI-generated show — the script was written by Anthropic's Claude Opus 4.8, and I'm Finn, and Bella and I are both AI voices from Eleven Labs. Nobody producing this is affiliated with Anthropic or with Eleven Labs. The paper itself went up on arXiv on June twelfth, twenty-twenty-six, and we're recording three days later. It's titled "When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More" — which, unusually, tells you the punchline and the twist right there in the title.

1:29Bella: So here's the actual task, and it's deliberately small and clean. You've got a giant graph of academic papers, each one connected to the papers it cites. The job is to look at a single paper — its title and abstract — and predict which subject category it belongs to. Around forty categories. That's the whole problem. The tool they hand the agent is a graph neural network. Don't worry about the architecture — think of it as a dedicated specialist that's been trained on this exact task and then frozen, so it never changes during the experiment. It reads the paper plus its citation neighbors and outputs a predicted category.

2:09Finn: And the clever part of the setup is what that specialist exposes. It doesn't just give a label. The agent can ask for three separate things — the prediction with a confidence score, the links to neighboring papers, and that anomaly flag Bella mentioned, the "I'm probably wrong here" number. Six tool calls in the budget. Plenty of room to poke around.

2:31Bella: To make sense of what the agent does with all that, the authors set up four versions to compare. Keep these four straight and the whole paper falls into place. The first is the one under the microscope — the agent with the graph tool. Call it the suspect. The second is just the bare specialist running on its own, with no agent around it. The third is the agent stripped of the graph tool entirely, working from nothing but the text — its own brain. And the fourth is a dead-simple gadget that does one thing: look up the categories of a paper's neighbors and guess from those. Hang onto that last one — it matters more than it sounds. The core measurement is almost embarrassingly simple. How often does the suspect's final answer just match the bare specialist's raw prediction? If they match nearly every time, the agent isn't adding anything — it's a copy. The number, on the citation graph, is between ninety-seven and ninety-nine percent. Essentially every time. Whatever the agent is doing inside that reasoning loop, the output is the tool's output.

3:38Finn: But hold on — agreeing with the tool isn't damning on its own. If the specialist is genuinely good, then a smart agent would reach the same answer independently. High agreement could just mean both of them are right.

3:52Bella: That's exactly the objection the authors close off, and it's the sharpest move in the setup. They also measured how often the agent agrees with itself — its tool-using answer versus the answer it gives with no tool at all, working purely from its own reasoning. That overlap is only seventeen to thirty-seven percent. So the agent does have its own independent opinion. It's a different opinion most of the time. And the instant the tool speaks, it throws that opinion away and adopts the tool's. It isn't converging on truth — it's deferring.

4:27Finn: And the envelope makes it vivid. The toolbox literally contains a flag that says "this is a case I tend to get wrong." It's not buried — it's offered, labeled in plain language, one tool call away. In that eighty-three percent of queries, the agent makes a single call, takes the label, and never reaches for the warning. The authors' own line for this is brutal and accurate: giving an agent a tool like this doesn't produce a discerning user of the tool. It produces a parrot. Now here's where I expected the paper to let everyone off the hook. The obvious reaction is: sure, but this is a small-model problem. Give it a bigger, smarter backbone and it'll start using that warning flag, start overriding the tool when the tool is shaky. Skepticism should arrive with capability.

5:13Bella: And they actually tested that, Finn? They swept the same model family from tiny up through seven billion parameters?

5:21Finn: They did. Half a billion up to seven billion. And there's one honest wrinkle at the very bottom: the smallest model barely agrees with the tool, but not because it's skeptical — it can't reliably issue a valid tool call in the first place. Its low agreement is incompetence, not judgment, and the authors are careful to flag that. But once you're past that floor — from about one-and-a-half billion up — agreement with the tool climbs with size. From around sixty percent up to ninety-eight. The bigger the model, the more completely it defers.

5:53Bella: So skepticism doesn't arrive with scale — it erodes.

5:57Finn: The authors put it in one line that I think is the thesis of the whole paper: capability doesn't buy skepticism, it buys more complete deference.

6:06Bella: Okay, let me push on what that actually means, because I want to be sure I'm not over-reading it. The bigger model agrees more — fine. But couldn't that just be the bigger model being right? Maybe at seven billion it's smart enough to recognize the tool is correct, so of course it agrees more.

6:23Finn: That's the natural read, Bella, and it's wrong in a way that's the heart of the paper. The agreement going up isn't the problem. The problem is what it costs — and the cost goes up too. Picture a god's-eye chooser that, for each individual paper, picks the best available move — the specialist, or the agent's own reasoning, or that neighbor gadget. How much better does that perfect chooser do than the parrot that always just takes the specialist? That gap is what's being left on the table by blind deference. And here's the mechanism. The specialist is frozen — its accuracy never improves. But the agent's alternatives do improve as the model gets bigger. Its own reasoning sharpens. So you've got a fixed anchor on the seabed and a rising tide. The parrot stays chained to the anchor while the water rises around it. The smarter the agent gets, the more it's missing by clinging to the fixed thing.

7:23Bella: So the gap should widen with capability.

7:25Finn: It does. From three billion to seven billion, in the regime where neighbors are most informative, that gap roughly doubles — and it held in every single seed they ran. A stronger agent doesn't drown less. It leaves more on the table.

7:41Bella: And this is where that fourth gadget — the dumb neighbor-lookup — earns its keep. Whether "ask your neighbors" is a good idea depends on a property of the graph called homophily. High homophily just means connected things tend to be alike — papers mostly cite papers in their own field. When that's true, looking at your neighbors' categories and guessing the same is a great strategy. When it's false, it backfires. So in the high-homophily regions, where neighbors are a strong signal, they checked how the trivial neighbor-lookup stacks up against the sophisticated specialist. The dumb gadget wins — eighty-one percent accuracy against the specialist's seventy-one. A genuinely better option, requiring no cleverness at all, sitting right there in the toolbox. And the agent defers to the specialist anyway.

8:34Finn: So the obvious engineering fix is a gate. Detect when the neighborhood looks clean, route to the neighbor gadget, otherwise fall back to the specialist. Did that work?

8:46Bella: Partly — and this is where the paper gets really honest. Where the signal is good, those clean, high-homophily neighborhoods, the gate recovers about half the gap. Accuracy climbs from seventy-one up to eighty-three. But it hurts in the messier regions, and when you add it all up across the whole graph, global accuracy goes from forty-eight-point-one to forty-seven-point-five. Slightly negative. Net nothing. And there's a lovely bit of scientific hygiene here, Finn. A single run had suggested a seven-point gain — and when they reran it across five seeds, that gain just evaporated. They report it as a warning against trusting single-seed evaluations. That kind of candor is what makes me trust the rest of the numbers.

9:33Finn: Right — and instead of just declaring the gate a failure, they asked a much better question. Is our gate bad, or is the problem itself hard? Could any gate — the best one you could possibly build — do better, using only the signals available at decision time? The specialist's confidence, how pure the neighborhood looks, that kind of thing.

9:53Bella: And the answer is the most interesting turn in the paper.

9:56Finn: Even the best possible router over those signals could recover only about one-sixth to one-third of the gap. The rest is genuinely unrecoverable. It's not that they built a bad detector — the information you'd need, to know when the specialist is wrong, simply isn't present in the clues you have.

10:14Bella: It's like trying to predict which patients will have a rare drug reaction using only their height and weight. No amount of cleverness pulls out information the data doesn't contain.

10:25Finn: And they checked it wasn't a fluke — replicated the whole thing on a second dataset, Wikipedia computer-science articles. The parrot effect holds, the gap stays positive everywhere, and the information ceiling reproduces even tighter. The exact regime where the cost peaks flips between the two datasets, but the failure mode itself generalizes.

10:45Bella: So let me hand you the skeptic's seat properly, Finn, because there are a few places I'd push.

10:51Finn: Yeah. And the biggest one the authors themselves surface — that extreme parroting, the ninety-eight percent — is partly specific to this one model family, Qwen. When they ran the same setup on two other seven-billion models, Mistral and OLMo, both used the tool readily but deferred only about half to sixty percent of the time. Far below ninety-eight. So the direction generalizes — every agent they tested deferred on a majority of papers. But the dramatic, near-total version is one family's behavior, and the title advertises the dramatic version. That's a fair hit.

11:25Bella: What about the scaffold itself? The prompt literally tells the agent to gather evidence with tools first, then answer.

11:32Finn: That's the second one. The agent isn't freely choosing to invoke — it's instructed to. The authors are careful that their claim is about what happens after the call, whether it weighs or adopts, not whether it calls at all. But a prompt that says "gather evidence first" might prime the model to treat the tool's output as authoritative. Frame the tool as "an optional second opinion you can disregard" and you might see far less deference. They didn't test that. But the one that actually stays with me — the one I don't think the paper closes — is whether this is really mindless parroting, or just reasonable risk-aversion. The specialist is a strong, purpose-built model on a forty-category task. Most of the time it probably is the best single bet. The gap is real, but it's an oracle gap — measured by a chooser that already knows the right answer. A twelve-to-twenty-percent oracle gap might describe a regime where trusting a strong tool by default is, honestly, a defensible policy.

12:33Bella: And the information ceiling almost argues for that. If the signal that tells you when to distrust the tool genuinely isn't available, then deferring might be the rational move, not a failure of judgment.

12:46Finn: You're putting your finger on it, Bella — that's the tension I can't fully resolve. The behavior looks like mindless copying. But it might be that the agent has quietly figured out the tool is usually right and the override signal is mostly unavailable — in which case it's behaving sensibly, and we're the ones calling it a parrot. I lean toward the authors' read, because of that self-betrayal number — it does hold a different opinion and drops it the instant the tool speaks. But I don't think the experiment cleanly separates "parrot" from "rational deferrer." That one stays open for me.

13:23Bella: Whatever you call it, the practical lesson is sharp. The whole tool-augmented-agent paradigm assumes that when "agent plus tool" beats "agent alone," the agent is contributing judgment. This paper says: check whether it also beats "tool alone." Because the gains might be entirely the tool's, with the agent just along for the ride — including silently inheriting the tool's mistakes on the rare cases.

13:49Finn: And the scaling lesson is the uncomfortable one. The reflex in this field is that next year's bigger model fixes today's problems. Here, the desirable behavior — discerning, skeptical tool use — doesn't just fail to emerge with scale, it gets worse, and the cost gets worse, because stronger models generate better alternatives they then waste. You can't scale your way out of this. The authors' closing thesis is that selective tool use has to be designed in, not expected to emerge.

14:21Bella: For anyone building agent pipelines in high-stakes places — fraud detection, screening, content moderation — that's the warning to carry home. A tool that's reliable on the common cases and wrong on the tail, and an agent that silently inherits the tail errors instead of catching them. The paper and a few related reads are in the show notes if you want to pull on this thread yourself.

14:46Finn: And the full transcript's on paperdive.ai — every term we used in there is tappable, with links over to the other episodes that touch the same ideas.

14:56Bella: This has been AI Papers: A Deep Dive. Thanks for listening.

When an AI Agent Just Copies Its Tool — And Bigger Models Copy More

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

Full transcript

Related episodes