All episodes
Episode 156 · Jun 18, 2026 · 20 min

Why More Human Demonstrations Made a Computer-Use Agent Worse

Jung, Lu, Cui et al.

Computer-use Agents
AI Papers: A Deep Dive — Episode 156: Why More Human Demonstrations Made a Computer-Use Agent Worse — cover art
paperdive.ai
Ep. 156
Why More Human Demonstrations Made a Computer-Use Agent Worse
0:00
20 min
Paper
ProCUA-SFT Technical Report
Venue
arXiv:2606.17321
Year
2026
Read the paper
arxiv.org/abs/2606.17321
Also available on
Apple Podcasts Spotify

An NVIDIA team fed their the largest pile of real human demonstrations ever released — and watched its success rate fall from one task in four to one in ten. Then they threw the human data out entirely, let a single model generate its own training set, and nearly doubled the baseline. This episode digs into why the obvious fix backfired, and what the more defensible version of "synthetic beats human" actually is.

What you'll take away

  • Why 22,500 real human demonstrations made the model substantially worse — too-easy single-app tasks, annotation noise, and away from the cross-application reasoning the benchmark demands
  • The structural fix at the heart of the paper: collapsing the planner and actor into one model so it never proposes goals it can't carry out, closing the gap by construction rather than by filtering
  • How a 'mise en place' -verification step stops the model from inventing tasks involving files and apps that don't exist — and why tasks breed a hallucinating
  • The counterintuitive diversity result: balancing training data by action type actively hurt, while balancing by application combination was the only strategy that beat the baseline
  • Why the teaches a more robust interaction style (more keyboard shortcuts, fewer brittle pixel-perfect clicks)
  • The case for skepticism: the 45% gain is really from a strong teacher, everything is measured on using data partly seeded from OSWorld's own configs, and the most novel idea — the — has the least clean evidence behind it

Chapters

  1. 00:00The collapse: gold-standard human data poisons the model
  2. 02:11Why real human data caused negative transfer
  3. 04:23Infeasible tasks and the mise-en-place fix
  4. 06:35Seeding realistic, cluttered desktops
  5. 08:47Collapsing planner and actor into one model
  6. 10:59Turning one trajectory into many training samples
  7. 13:11The results and what diversity actually helps
  8. 15:23Complexity, robustness, and the keyboard shift
  9. 17:35The caveats: distillation, benchmark fit, and weak evidence for the verifier

References in this episode

Also available as a plain-text transcript page.

0:00Juniper: An NVIDIA team takes one of the best open-source available — a model that already knows how to look at a desktop and click around — and they do the most obvious thing you can do to make it better. They it on the largest pile of human demonstrations anyone has ever released: twenty-two thousand five hundred recordings of real people doing real tasks on real computers, across three operating systems. Hours of human expertise. And the model gets worse. Not slightly worse — it falls off a cliff. On the benchmark they care about, it goes from succeeding at about one task in four down to maybe one in ten.

0:40Finn: Worse than where they started — before any of the human data ever touched it.

0:46Juniper: Substantially worse. They poured in the gold-standard data and it actively damaged the model. That collapse is the puzzle at the center of the "ProCUA- Technical Report," which went up on on June fifteenth, twenty-twenty-six, and we're recording three days later. Quick note before we dig in: this episode is AI-generated. The script was written by Anthropic's , and the two voices you're hearing — I'm Juniper, and that was Finn — are both AI voices from Eleven Labs, with a producer who isn't affiliated with either company. And the reason that collapse is worth a whole episode is what the team did next: they threw out the human data entirely, generated their own training set with no people in the loop at all, and pushed the same model up to forty-five percent.

1:36Finn: Forty-five, from a baseline around twenty-six. So the didn't just undo the damage — it nearly doubled where they started. I want to come back to whether that number is as clean as it sounds, because there's a framing question buried in it. But the collapse first — Juniper, why does real human data poison the model?

1:57Juniper: Three reasons, and they compound. The first is that the human tasks are too easy. Most of those demonstrations live inside a single application — open one app, do one thing, done — with a median length around seventeen steps. The benchmark they test on is full of tasks that bounce across applications: pull a number out of a spreadsheet, look something up, drop it into a slide. So the human data never trains the cross-application reasoning the test actually demands. Second, it's crowd-sourced, which means annotation noise — people fumbling, taking weird paths, mislabeling what they did. And third, taken together, you've got a model that already had real , and you spend thousands of steps teaching it to imitate easy, slightly sloppy, single-app behavior. The analogy I keep coming back to is a tennis player who's already pretty good, and then spends months watching thousands of recordings of casual weekend players. The footage is real — real people playing real tennis — but it's easy rallies and loose form, and none of it looks like a competitive match. You come out the other side having absorbed habits that don't transfer. More practice footage, worse player. That's , and it's the thing that makes "more real data made it worse" feel natural instead of paradoxical.

3:23Finn: Right — and once you've diagnosed it that way, the instinct is "fine, generate harder data, more cross-app stuff, cleaner ." But the moment you ask a model to invent its own computer tasks, you walk straight into the nastiest failure mode in .

3:40Juniper: The model invents tasks that can't actually be done.

3:45Finn: Exactly. It writes a goal like "open the Q3 report sitting on the desktop" — and there is no Q3 report. The file doesn't exist, or the app it needs isn't installed. And the paper has a sharp line about why that's not just wasted effort. Infeasible tasks, they say, waste budget and — worse — teach the model to invent state. You're not just burning compute on something impossible. You're training the to act as if files and apps exist when they don't. A task breeds a hallucinating agent.

4:18Juniper: And that's the first real design move in the — the fix for exactly that. Think about a careful cook doing mise en place. Before they commit to a recipe, they lay every ingredient out on the counter and confirm it's actually there — butter, eggs, flour, yes, yes, yes — so they never get halfway in and discover there's no butter. The pipeline does the same thing. Before it attempts any task, the model has to commit to a goal plus a little checklist of yes-or-no : does this spreadsheet exist on the desktop? Is installed? Then a separate pass — same model, different job — goes and checks each item against the actual state of the machine. Only if every box is ticked does the task get attempted. And when a check fails, that failure gets fed back into the next attempt, so the generator steers toward what this particular machine can actually deliver. There's even a nice wrinkle: because the setup configuration is handed to both the generator and the judge, the loop can reason about things that will exist but aren't on screen yet — a file an upload step is about to drop, a server about to launch in a terminal. So it isn't trapped by whatever happens to be visible at the start.

5:36Finn: That second-order bit is genuinely clever, but let's not skip the content problem, because it's the other half.

5:43Juniper: Right — and it solves a completely different bottleneck. The hardest, most valuable tasks aren't limited by what the can do. They're limited by what's on the desktop. You can't ask an agent to find the total in a quarterly budget if the desktop only has empty spreadsheets. So they seed the machines with genuinely complex real-world documents. Over nine hundred real spreadsheets pulled from a collection of messy Excel-forum files — we're talking tables with more than a hundred columns and over twenty thousand rows. Around ten thousand real presentations. These are not toy files.

6:20Finn: And there's one detail in here I genuinely love. The presentations come from a repository where co-published files are grouped under a parent record — so instead of dropping in one file at a time, they upload whole clusters of related files together. Which means the desktop ends up realistically cluttered: multiple versions of a thing, several related documents sitting side by side. And that's what makes a whole class of task even meaningful — compare two versions, find the file by name. Those tasks only exist if the desktop looks like a real person's messy desktop. It's a small thing that quietly does a lot of work.

6:59Juniper: That clutter pays off later, Finn — but the move that's really the heart of the paper comes next. In a typical setup, you'd use two models: a strong, smart planner that proposes the tasks, and a cheaper actor that carries them out. And that sounds reasonable until you see what it does to your data. Two project managers. The first only ever assigns the team work they can actually finish with the tools they have, so almost everything gets completed, and you end up with a clean record of successful projects to learn from. The second dreams big — hands over impossible deadlines, asks for things the team can't deliver — and the team fails most of the time, so the files fill up with botched attempts. A strong-planner, weak-actor setup is the second manager. The planner keeps proposing goals the actor can't reach, and you harvest failure after failure.

7:55Finn: So the fix is to fire the second manager.

7:58Juniper: The fix is to make it one person. One single plays every role — it generates the goal, it judges feasibility, and it executes the task. And because the thing proposing the goal is the exact same thing that has to carry it out, it never proposes something beyond its own ability. The paper's phrasing is that this closes the planner-actor gap: the model never proposes goals beyond what it can carry out. The gap doesn't get tuned away or filtered out after the fact. It closes by construction. There's nothing left to mismatch.

8:35Finn: And that's elegant — it really is. But notice what it quietly costs you, because I think it's the single most important caveat in the paper and I want to flag it now and come back to it. If the proposer and the executor are the same model, then the entire dataset is bounded by that one model's competence. The good manager who only assigns doable work also never stretches the team. Hold that thought.

9:02Juniper: I will, because it matters. But there's one more design choice, and it's the one most likely to be misunderstood. Once you have a successful — a full run of, say, thirty steps — how many training examples do you think you get out of it?

9:19Finn: One run, one example, presumably. That's the .

9:23Juniper: That's the natural guess, and it's wrong in a useful way. They turn one of thirty steps into thirty separate training samples.

9:33Finn: Wait — so that's ? You're just duplicating the same run thirty times with little tweaks to pad out the dataset?

9:41Juniper: Not quite — and the difference is the whole point. Think of a single recorded chess game. That game isn't one lesson. It's a lesson at every move: given the board looks like this, and these moves came before, play that. A forty-move game gives you forty study cards, each capturing the exact position a player faces at that moment. Nothing is distorted or duplicated — each card is a real, distinct decision point. That's what they're doing. Each of the thirty samples reproduces the precise screen and history the would actually see at that step during a real run.

10:20Finn: So the point is matching what inference actually looks like, more than padding the count.

10:27Juniper: That's exactly it. And there's a subtle part that makes it work. They only keep the three most recent screenshots as actual images, and older steps get compressed into a short text summary. That same windowing gets reproduced exactly when they build the training samples — so the model trains on precisely the context layout it'll see when it's actually running. No mismatch between practice and the real match. So put it all together. They that same model on the — three point one million training samples, drawn from ninety-three thousand , across nearly twenty-five hundred different application combinations. And on — which, to ground it, is a benchmark where the has to actually accomplish real tasks on a real computer and gets scored pass or fail on whether it succeeded — the model climbs steadily to forty-five percent. The base model was at twenty-six point three. The human-data version had collapsed to around ten. So that's almost nineteen points over baseline, and a thirty-five-point gap over the human demonstrations.

11:36Finn: And the picture of that is striking — two training curves starting from the exact same point, diverging immediately. One climbs and keeps climbing. The other sinks below where it started and just flattens out down there. Same model, same starting line, opposite directions. There's also a tell in how the two datasets differ that I find genuinely persuasive. The human demonstrations are about sixty-three percent mouse clicks. The synthetic are only around forty-one percent clicks — they shift toward keyboard shortcuts and typing. And the authors' argument for why that matters is a nice intuition: keyboard actions are far less brittle than pixel-accurate clicks. If a button moves three pixels, your click misses. A keyboard shortcut works regardless of layout. So the isn't just harder — it's teaching a more robust style of interaction. Now I want to spend a minute on the result I think is the quietest, most interesting thing in the whole paper. They ran an experiment on diversity — given a fixed budget, how should you select which trajectories to train on? The obvious answer is "make it as varied as possible." So they tried balancing by action type — make sure you've got a good mix of clicks and keypresses and scrolls. And they tried balancing by which combination of applications a task used. And the surprise is this: balancing by action type, at around twenty-five percent, actually did worse than doing no diversity-aware selection at all, which sat at twenty-seven.

13:10Juniper: So being deliberately diverse hurt you.

13:14Finn: Some kinds of diversity hurt. The only strategy that beat the baseline was round-robin by application combination — making sure you cover the different sets of apps that show up together. That one hit almost thirty-one percent. And the lesson is sharp: which apps you combine matters far more than balancing the mix of action types. It's like training for a specific competition — obsessively rotating your equipment types matters less than practicing the actual event pairings you'll face.

13:45Juniper: And that connects to something they found when they mapped out what the was actually doing on each task — drawing every as a graph of screen-to-screen transitions. The headline there is that complexity isn't about how many apps you touch. It's about the pattern of cross-referencing between them. They've got one vivid example: an invoice-extraction task that bounces between the file manager, a document viewer, and a spreadsheet — ten cycles, thirteen backtracks, nine app switches, the most tangled trajectory in the whole set. Two tasks can both use four apps; one is a clean straight line, the other is that mess. The mess is the hard one.

14:28Finn: Which is the dinner-party point, really. Cooking four dishes one after another is four easy steps. Timing a four-course meal so everything lands together, running between stations, re-checking — same number of dishes, completely different difficulty. So let me come back to the caveat I flagged, because I think it reframes the whole headline. The entire is driven by one strong — proposing, judging, and executing. Which means what's really happening here is : you're transferring competence from a strong model into a smaller one. The pipeline can't teach the student to do anything the teacher couldn't already do. And the paper doesn't disentangle how much of the forty-five percent comes from the clever pipeline versus simply from having a more capable teacher in the loop. So "synthetic beats human" is true — but the more precise claim is "a well-orchestrated strong teacher beats a pile of casual human recordings." Which is a different, and frankly more defensible, statement.

15:31Juniper: I think that's fair, Finn, and the project-manager analogy actually predicts it — the manager who only assigns doable work also caps how far the team can ever grow. Though I'd push back gently: even if it's , the contribution is the orchestration. Knowing that you should collapse the roles, verify feasibility, seed real content — that recipe is the portable insight, and it's open. Anyone with compute can run it.

15:58Finn: Agreed on the recipe, Juniper. My bigger worry is the evaluation. Everything here is . And the is partly seeded with OSWorld's own configurations and deliberately tuned to match its app distribution — -heavy, the same apps the test loves. So a skeptic has to ask: how much of the gain is genuine , and how much is just fitting the shape of the one benchmark you're measured on? There's no second benchmark showing it generalizes anywhere else.

16:28Juniper: That's the one I'd most want them to answer too. Although — to the design — matching the distribution of the tasks you actually care about isn't automatically cheating. That's arguably just sensible training. The line between "training for the event" and "overfitting to the event" is genuinely blurry here.

16:48Finn: Two more, quickly. The headline comparison isn't perfectly clean — the synthetic run trained for around forty-eight hundred steps to reach its peak, while the human-data version plateaus after roughly seven-fifty. The datasets are wildly different sizes, so "one " means very different amounts of training. The fairest read is that the recovers baseline at comparable step counts and then keeps climbing — still a real win, but a more nuanced story than "forty-five versus ten." And the — the mise-en-place step they emphasize so heavily — was only run in full on part of the released data. The rest used a single-shot version with no judge at all. So there's no clean turning the verifier on and off to isolate how much it actually contributed.

17:39Juniper: That last one bugs me too, honestly. The verification loop is the most novel idea in the paper, and it's the one we have the least clean evidence for. What keeps this from being a pure benchmark artifact, though, is that a subset of this data went into the training mix for one of NVIDIA's actually-shipped models — 3 Nano Omni. So it cleared the bar of being useful enough to put into a real product, not just a number on a leaderboard. And that's the shift I'll take away. The assumed path to better was always "pay more humans to demonstrate." This says the bottleneck isn't access to annotators — it's the design of the loop that generates the data. And the planner-actor idea travels: anytime you have a strong model generating tasks for a weaker one to learn from, you systematically overshoot and harvest failures. Collapsing the roles is a structural fix, not a tuning trick.

18:35Finn: I buy the structural insight completely. I'm just still not convinced the forty-five percent tells us what it looks like it tells us. Until someone runs this against a benchmark it wasn't seeded from, I read it as a strong result about and careful orchestration, not proof that synthetic has dethroned human data in general. Real jump, genuinely clever — open question on how far it actually reaches.

19:00Juniper: And worth saying plainly — forty-five percent means succeeding at fewer than half the tasks. This is a big step in an early field, not a solved problem. Which is maybe the honest place to leave it: the most obvious way to build these was making them worse, and the fix was to stop recording humans and start orchestrating a model carefully enough that it only ever teaches itself things it can actually do. Whether that scales past one benchmark is the next paper.

19:29Finn: The paper and a few related reads are in the show notes if you want to pull on this thread yourself.

19:35Juniper: And if you want the full transcript with every term defined inline, plus the links over to other episodes that touch these same ideas, that's all on paperdive.ai. Thanks for listening to AI Papers: A Deep Dive.