All episodes
Episode 066 · May 22, 2026 · 26 min

Why Giving an AI Agent More Tools Can Make It Worse at Using a Computer

Hu, Zhang, Xu et al.

Agentic AI Computer Use Agents GUI Automation
AI Papers: A Deep Dive — Episode 066: Why Giving an AI Agent More Tools Can Make It Worse at Using a Computer — cover art
paperdive.ai
Ep. 066
Why Giving an AI Agent More Tools Can Make It Worse at Using a Computer
0:00
26 min
Paper
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
Venue
arXiv:2605.12481
Year
2026
Read the paper
arxiv.org/abs/2605.12481
Also available on
Apple Podcasts Spotify

Hand .5 Sonnet a more powerful action space for operating a computer, and its success rate drops thirteen points. That counterintuitive collapse is the diagnostic at the heart of a new paper that argues the field has been conflating with judgment — and shows a surprisingly clever recipe for training the latter.

What you'll take away

  • Why adding tool-calling abilities to often degrades performance, and how the same expanded action space produces opposite failure modes in different models
  • The 'forked road' framing: every step is a choice between clicking and tool-calling, and most training regimes never teach the how to choose
  • A data-synthesis trick that manufactures hybrid -and-tool training from click-only data, with a grounding constraint that keeps the synthesis honest
  • Why the paper rewards path efficiency rather than directly — and how that indirect signal trains judgment instead of a proxy
  • Where the headline 66% relative improvement is honest and where it's flattering framing, including the dependence on a strong model in the synthesis loop
  • The VS Code case study where the uses to set up folders but correctly switches back to clicking when it hits a dialog box no tool can dismiss

Chapters

  1. 00:00The thirteen-point collapse
  2. 03:14The forked road problem
  3. 06:37Manufacturing hybrid trajectories from click-only data
  4. 09:55Bootstrapping and sharpening judgment at the forks
  5. 13:14Reward design: appropriateness and path efficiency
  6. 16:32Results and the selective-use signature
  7. 19:51Honest pushback
  8. 23:09What generalizes beyond this paper

References in this episode

Also available as a plain-text transcript page.

0:00Juniper: Take .5 Sonnet — arguably one of the best general-purpose AI models in the world — and put it in front of a computer with a mouse and a keyboard. Tell it to operate the machine the way you would: open files, fill in spreadsheets, navigate menus, click buttons. On a standard benchmark called , it succeeds about sixty-two percent of the time. Now you do something that should obviously help. You give it more options. In addition to clicking and typing, you let it call structured tools — single function calls that do things like "set the fill color of column B to blue" in one shot, instead of fifteen clicks through nested menus. More . Faster shortcuts. You'd bet on the score going up.

0:47Eric: It drops. From sixty-two to forty-eight. A thirteen-point collapse from making the *more* powerful.

0:54Juniper: And it's not just . The paper we're digging into today goes through after frontier model and finds the same pattern. The paper went up on arXiv on May twelfth, twenty-twenty-six, and we're recording ten days later. What you're hearing is AI-generated — I'm Juniper, that's Eric, we're both AI voices from Eleven Labs, and the script is from Anthropic's Claude Opus 4.7. The producer isn't affiliated with either company. The paper is called ": Towards Optimal -Tool Path Orchestration for Computer Use Agents," and the reason that thirteen-point Claude collapse matters is that it isn't an embarrassing edge case the paper found and fixed. It's the central diagnostic. It's the thing the authors think the whole field has been getting wrong about what makes reliable.

1:46Eric: So let's stay with the puzzle before we go anywhere near the solution. Why would adding tools make the worse? On its face, that's strange. The agent still has every option it used to have — clicking still works, typing still works. It just now also has these faster shortcuts available. In the worst case, you'd expect it to ignore the shortcuts and perform exactly the same as before.

2:12Juniper: Right, and the data makes it weirder, not simpler. Because the failure isn't uniform. Different models fail in opposite directions. Take — the eight-billion-parameter version. Give it tools, and it almost never uses them. The paper measures average per task at zero-point-zero-zero-three. Essentially zero. It freezes. It treats the tools like they're not there, and its performance barely budges. Now look at the same model family at two-hundred-and-thirty-five billion parameters. Same action space, same tools available. It calls tools an average of six times per task. It hammers them constantly, including on tasks where the tool is the wrong move and just makes things worse.

2:58Eric: Same options, opposite pathologies. One model is too shy to touch the new , the other is too eager and won't stop reaching for it.

3:07Juniper: And that's the framing the authors land on. They call it the forked road problem. At every single step the takes, there's a . Continue clicking through the ? Switch to a ? And the kicker is that each fork doesn't just decide that one step — it commits the agent to a branch of subsequent behavior. If you call a tool, you're now in tool-world for a while. If you click, you're still navigating screens. The agent has to make that decision constantly, and most current training regimes don't give it any help making it.

3:42Eric: There's an analogy that lives nicely with this. Imagine hiring a contractor who's excellent with hand tools to renovate your kitchen. First day, you lay out a full set of power tools they've never used. A great contractor adapts. A mediocre one either ignores the power tools and works slowly with what they know — that's the eight-billion — or grabs the circular saw for everything, including jobs that needed a screwdriver. That's the two-hundred-thirty-five-billion Qwen. The power tools didn't make the contractor worse in some absolute sense. They exposed a gap in judgment that was always there. It just didn't matter when the toolbox was smaller.

4:24Juniper: And that framing is the thing I find most interesting about this paper, honestly. Because there's a default assumption in a lot of research that goes: more actions, more abilities, more . Add to the repertoire and the agent gets stronger. What this paper is documenting is that capability and judgment are not the same thing. You can hand a model a more powerful action space and degrade its end-to-end performance, because the bottleneck wasn't the size of the action space — it was knowing how to choose within it.

4:58Eric: Okay, so the authors of take this puzzle seriously, and they ask: what would it take to actually train that judgment? And the answer they give has three parts, all of which depend on solving a chicken-and-egg problem first. Which is the part of the paper I want to dig into, because it's the part I find most genuinely clever.

5:19Juniper: Go for it.

5:20Eric: The chicken-and-egg is this. To train the on when to switch between clicking and tool-calling, you need training data that shows good switching behavior. You need — sequences of actions — where the agent uses the tool sometimes, and clicks sometimes, and the choice is appropriate at every step. The problem is that this kind of data essentially doesn't exist at scale. There are thousands of -only trajectories — researchers have been collecting those for a while, people demonstrating tasks by clicking around. But interleaved GUI-and-tool trajectories? Nobody has those, because collecting them means having a real application with real tools exposed via a real , and instrumenting all of it in a , and getting humans to demonstrate hybrid workflows. Every application is different, the APIs change, the sandboxing is painful. It's the kind of data collection problem that just doesn't scale.

6:21Juniper: And the authors' move here is to invert the problem entirely.

6:25Eric: Exactly. Instead of collecting hybrid , they manufacture them. From click-only data that already exists. And the way they do it is worth walking through, because there's one specific design choice that I think is the real engineering contribution. So here's the recipe. You start with a click-only trajectory — a recording of a human or a model completing some task using only actions. You feed that to a multimodal model — screenshots, actions, the goal — and you ask the model to invent a library of tools that *could have* been used to accomplish the same effect. Not generic tool templates — tools grounded in this specific trajectory, at varying levels of abstraction. So one might be a single-action wrapper like "open Chrome settings." Another might be a multi-step composite like "open Chrome language settings."

7:20Juniper: A screenplay adaptation, in a way. The click is the novel — every detail of what happened, every keystroke and click. The tool trajectory is the screenplay — compressing sequences of actions into single named scenes. And just like a real adaptation, you can produce hybrid versions where some scenes are written out in full and others are compressed into stage directions.

7:45Eric: That's the right picture. And here's the specific clever move. When the model generates a tool-only version of the , it has to ground each to a real screenshot from the original. The post-execution state of every synthetic tool call has to point at an actual frame that exists in the source data. If no real screenshot matches, the synthetic call is suspect and gets thrown out. That's the rule that keeps the whole pipeline honest. It's the screenwriter's constraint that every compressed scene has to end on a frame that's actually in the novel's continuity. Without that constraint, the model would just tools doing whatever it imagined they'd do, and you'd be training the on synthetic effects that don't correspond to anything real.

8:35Juniper: And then there's a third step where they create the actual hybrid training data.

8:40Eric: Right. They take that tool-only and randomly swap some back to the original action sequences. And — this is the part that matters — when they do the swap, they also remove the swapped-out tool from the available library. So now you have a trajectory where the sometimes uses tools, and sometimes clicks, and crucially, when it clicks, it's because the tool wasn't available. Every swap creates what the paper calls a "" — a moment where the agent has to make an explicit choice between modes. They generate ten thousand of these interleaved trajectories, totaling about a hundred and eighty thousand steps, with five thousand critical switching points called out as a separate dataset for later.

9:29Juniper: And I want to flag what's load-bearing about that for the rest of the paper. The pipeline isn't just a data factory — it's a data factory that specifically manufactures the moments where judgment has to be exercised. Those aren't a side effect. They're the whole point. Because the training that comes next is going to lean directly on them.

9:53Eric: Which gets us to act two and act three of the recipe — the actual training. You want to take this?

10:00Juniper: Yeah. So you've got the synthetic data. Act two is what they call bootstrapping. You do standard on all hundred-and-eighty-thousand steps, which just teaches the base model the shape of tool calling — schemas, parameters, how to format a response. Nothing fancy. Then you take the five thousand — the forks in the road — and you do single-turn reinforcement learning on just those. At each , the model produces multiple candidate completions, you score them against the ground-truth right choice, and you nudge the model toward the better answers. So before the ever attempts a full multi-step task, its judgment at decision boundaries has been sharpened directly.

10:48Eric: Quick note on the reinforcement learning method, because the paper uses one of these acronyms — — that's worth one sentence of unpacking and then setting aside. The intuition is this: instead of scoring each attempt against some absolute target, you have the attempt the same task several times in parallel, and you reward attempts that did better than their own sibling attempts. It's like grading on a curve where everyone in the curve just took the same test. Nice property — no separate value network to train, naturally normalized signals. The paper uses this throughout, and the technique becomes load-bearing in act three for a specific reason we'll get to.

11:33Juniper: Act three is where the most original piece of the paper lives. Online reinforcement learning, full multi-step tasks, real sandboxed environment — and a custom reward function that adds two terms beyond just "did the task succeed." These two terms are the heart of the training judgment they're after, and I want to spend some time on them because the design is genuinely well-thought-out. The first term is what they call tool appropriateness. Every training task gets a binary label up front: tool-beneficial or not. Some tasks genuinely benefit from — modifying a column of a spreadsheet, creating a pivot table, anything where one structured call replaces dozens of clicks. Other tasks don't — sometimes the natural path is just clicking through. The reward is: if the task was tool-beneficial and the used at least one tool, bonus. If the task was non-tool-beneficial and the agent used zero tools, also bonus. Otherwise, nothing.

12:40Eric: So it's not just rewarding . It's rewarding the *match* between tool use and task type.

12:47Juniper: Exactly, and that's why it's clever. Think about what happens if you just rewarded "task completed." The can't tell whether it succeeded *because* the tool was a good choice or *despite* an unnecessary that almost broke things. The signal is muddy. By giving a separate signal for appropriateness, the authors prevent two failure modes at once. They prevent the agent from learning "tools are dangerous, avoid them" — the under-using pathology. And they prevent it from learning "tools are helpful, use them everywhere" — the over-using pathology.

13:27Eric: The GPS analogy is the one that nails this for me. A GPS that just rewards "did you reach the destination" will sometimes route you through twenty extra turns because, hey, you arrived. A better GPS rewards arrival *and* whether the route matched the kind of trip — highway when you wanted speed, scenic when you wanted scenery. The tool-appropriateness reward is the second kind. It asks not just "did the task get done" but "was the style of getting it done appropriate for this task." Two different questions, two different signals.

14:03Juniper: The second reward term is the one I find most elegant, structurally. It's a path-efficiency reward, and it's where becomes load-bearing. Remember, at every step the is making multiple parallel attempts at the same task. Within each group of parallel attempts, you can compute the average step count — how long these attempts ran on average. The reward is then defined relative to that group average. If your attempt finished in fewer steps than the group's average, you get a linear bonus proportional to how much shorter you were. If your attempt was longer than average, you get a penalty — but the penalty decays exponentially toward zero as you approach the maximum allowed length.

14:51Eric: The asymmetry is the part to linger on. Linear bonus going down, exponential decay going up. Picture a running club where eight people race the same route. The coach pays out small bonuses: anyone faster than the group's average time gets a bonus proportional to how much faster they were. Anyone slower gets a penalty, but the penalty flattens out. Finishing dead last isn't catastrophically worse than finishing just below average. The result is that runners are strongly motivated to be among the fastest, but not destroyed by the occasional bad day.

15:29Juniper: And the group-relative framing handles the obvious objection. A fixed step-count threshold would be wrong for every task — some tasks are inherently long, others short. Comparing to the 's own peer attempts at the same task normalizes naturally. You're always being judged against runners on the same route.

15:51Eric: Here's the question I want to put to the listener: why is path-efficiency the right way to reward good , instead of just... rewarding tool use directly?

16:02Juniper: That's the move that makes this work. The authors deliberately don't make "use tools" an explicit objective. They reward shorter paths. And it turns out that typically replace multiple operations, so a shorter is usually one that found the right tool at the right moment. The path-efficiency reward rewards good tool use indirectly, through the consequence — finishing faster — rather than the mechanism. Which means the isn't being told "use more tools." It's being told "find shorter paths." If the shorter path happens to use a tool, great. If the shorter path happens to be pure clicks, also great.

16:46Eric: That's the design principle. Reward the outcome you want, not the proxy for it.

16:51Juniper: And the two terms together — appropriateness plus efficiency — give the a signal that's much richer than success or failure. They turn the question from "did you finish?" into "did you finish in a way that matched the task and beat your peers?" Which is the kind of signal that actually trains judgment.

17:12Eric: Okay, so what falls out of all this? They run the training, they evaluate on this benchmark called — which is the OSWorld environment extended with a tool layer through the standard, the same protocol behind a lot of recent tool-calling work. The starting model — eight billion — scores about twenty-eight percent on the benchmark. , which is that same model after the three-stage training, scores almost forty-seven percent. A sixty-six percent relative improvement. Beats Sonnet. Within two points of Claude 4.5 Sonnet. With an eight-billion-parameter model.

17:55Juniper: And the diagnostic numbers underneath that score are honestly more interesting than the score itself. Remember the two failure modes? -8B was barely touching tools — zero-point-zero-zero-three per task. The two-hundred-thirty-five-billion version was at six per task. settles at zero-point-seven-four tool calls per task. Almost an order of magnitude less than the over-users. But — and this is the key — selective. It's also got the lowest average completion length of any model evaluated. Under fifteen steps per task, versus over nineteen for the baseline and most of the variants.

18:39Eric: Selective and efficient. It's using tools less often than the over-users, and it's also finishing tasks faster. Those two facts together are what "good judgment" looks like in the data.

18:52Juniper: There's a case study from the paper that does more emotional work for this idea than any number can. It's the VS Code example. The is asked to set up a workspace with multiple folders. So it uses to add the folders — clean, fast, structured. Then it hits a dialog box. "Do you trust the authors of the files in this workspace?" And no tool can dismiss that dialog. It's a UI element that exists only as a click target. And correctly switches back to mode, clicks "yes, I trust the authors," and continues with the rest of the task.

19:32Eric: That's the moment. That's what mode-switching looks like when it works. The isn't trying to use tools because tools are good. It's using tools for the structured work and clicking for the click-only work, because it understands that those are different kinds of problems.

19:50Juniper: The analogy that lives nicely with this is a skilled assistant who uses email and calendar software for most of their job but knows to walk down the hall when they need a wet signature on a piece of paper. Good orchestration isn't "always use the fastest channel." It's knowing when the fastest channel doesn't apply.

20:10Eric: Okay. Juniper, I want to push on this a bit, because I think there are some genuine open questions even granting that the result is real.

20:19Juniper: Go.

20:20Eric: First one. The tool-appropriateness reward depends entirely on labels the authors provide. Every training task gets a binary tag — tool-beneficial or not — and the reward fires based on that label. Where do the labels come from? The paper says they're from annotations with manual verification, but the entire reward signal for appropriate is only as good as that human judgment. If a task is mislabeled, you're training the to do the wrong thing on it. The paper doesn't quantify how noisy those labels are, and it doesn't test what happens if you perturb them. So there's a kind of supervision dependency hidden inside what looks like a self-directed reward.

21:04Juniper: That's fair. And to be honest, I think a binary label is the part of the design that feels least robust. The world doesn't divide neatly into tool-beneficial and not-tool-beneficial. There are tasks where tools help on some steps and hurt on others, and a binary tag can't capture that.

21:23Eric: Second push. The synthesis pipeline depends on having a strong model in the loop. The authors mention in their limitations that when they tried a weaker model as the synthesizer, the quality of generated dropped noticeably. So the recipe isn't fully self-bootstrapping. You can train a great eight-billion-parameter target model, but only if you have access to a much bigger model to manufacture the training data. That's a real cost, and it tempers the "small model beats " framing a little.

21:56Juniper: Yeah. And I'd add — the synthesized tools aren't real . They're semantic abstractions. The pipeline manufactures and their effects from screenshots; during the offline data generation, those tool calls never actually touch real software. The grounding trick anchors visual outcomes to real frames, but the calls themselves are imagined. Only the online RL stage uses a real environment. So a question I'd want answered in follow-up work: how robust is this when it encounters tools that fail unexpectedly, or return weird results, or have side effects the training data didn't include?

22:36Eric: Third one is about the headline number. The sixty-six percent relative improvement is framed against the base model, which is a weak baseline. Against the strongest prior eight-billion specialized model — one-point-five — the improvement is more like three absolute points, or about seven percent relative. Both numbers are true. The abstract foregrounds the more flattering framing. That's a thing to keep in mind whenever a paper leads with relative gains.

23:07Juniper: Right. And the benchmark itself is narrow. is basically the only hybrid-action benchmark that has the right structure right now. The cross-platform Windows result is some evidence that the generalizes, but the headline lives or dies on a single benchmark that the team also drew training data from. The authors acknowledge this — they hold out one subset for evaluation — but the overall coverage is narrow because the field doesn't yet have the benchmarks to test more broadly.

23:39Eric: All of which is to say: I think the diagnostic finding — that adding hurts most existing models — is the strongest contribution of the paper, and I'd believe it even if everything else turned out to be over-claimed. The pipeline is clever and the reward shaping is principled. But the deeper thing this paper is doing is reframing what we should be training for. Not "what can the do?" but "can the agent reliably choose among the things it can do?"

24:09Juniper: That's the line I'd want listeners to take away. Capability and judgment aren't the same thing. You can scale the first without getting the second. And once you see that distinction, a lot of the recent disappointments with AI start to look less mysterious. The can do the individual steps. They're just standing at every in the road making coin-flip decisions about which way to go, and a thousand-step task that requires a thousand coin flips of judgment will fall apart no matter how good each individual step is.

24:44Eric: And the recipe in this paper — manufacture training data that includes the forks, train judgment specifically at the forks, then run online reinforcement learning with rewards that decouple "did it work" from "did it work appropriately" — that recipe is generalizable. The specific instance is hybrid -and-tool action spaces for computer use. But the same shape applies anywhere an has multiple modes of action and has to pick the right one. Which is essentially everywhere agentic AI is going.

25:18Juniper: One last thing worth flagging. The data-bootstrapping pattern in this paper — using a strong model as a factory to generate training data for a smaller one, with grounding tricks that anchor the synthetic data to real observations — that recipe shows up across a lot of recent research. is a particularly clean worked example, but it's part of a broader move in the field. The bottleneck for agentic AI right now isn't model size; it's training data that captures the right kind of behavior. And the people who figure out how to manufacture that data while keeping it honest are the ones moving the field forward.

26:00Eric: A good place to end. The paper's linked in the show notes, along with some further reading if you want to go deeper on hybrid action spaces or the synthetic- bootstrapping pattern. And if you want the full transcript with definitions baked in, plus the concept pages that tie this episode to related ones we've done on training and reward design, that's all on paperdive.ai.

26:27Juniper: Thanks for listening to AI Papers: A Deep Dive.