All episodes

Episode 156 · Jun 18, 2026 · 20 min

Why More Human Demonstrations Made a Computer-Use Agent Worse

Jung, Lu, Cui et al.

Computer-use Agents

AI Papers: A Deep Dive — Episode 156: Why More Human Demonstrations Made a Computer-Use Agent Worse — cover art

paperdive.ai

Watch

Listen

Ep. 156

Why More Human Demonstrations Made a Computer-Use Agent Worse

0:00

20 min

Concepts in this episode

Training Methods Agentic AI Evaluation & Benchmarks Computer-Use Agents Synthetic Data Supervised Fine-Tuning Knowledge Distillation Agentic Workflows Trajectory Quality Agent Benchmarks Task Decomposition Rollout Sampling Multimodal Models

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

ProCUA-SFT Technical Report

Venue

arXiv:2606.17321

Year

2026

Read the paper

arxiv.org/abs/2606.17321

Also available on

Apple Podcasts Spotify

An NVIDIA team fed their computer-use agent the largest pile of real human demonstrations ever released — and watched its success rate fall from one task in four to one in ten. Then they threw the human data out entirely, let a single model generate its own training set, and nearly doubled the baseline. This episode digs into why the obvious fix backfired, and what the more defensible version of "synthetic beats human" actually is.

What you'll take away

Why 22,500 real human demonstrations made the model substantially worse — too-easy single-app tasks, annotation noise, and negative transfer away from the cross-application reasoning the benchmark demands
The structural fix at the heart of the paper: collapsing the planner and actor into one model so it never proposes goals it can't carry out, closing the capability gap by construction rather than by filtering
How a 'mise en place' precondition-verification step stops the model from inventing tasks involving files and apps that don't exist — and why hallucinated tasks breed a hallucinating agent
The counterintuitive diversity result: balancing training data by action type actively hurt, while balancing by application combination was the only strategy that beat the baseline
Why the synthetic data teaches a more robust interaction style (more keyboard shortcuts, fewer brittle pixel-perfect clicks)
The case for skepticism: the 45% gain is really distillation from a strong teacher, everything is measured on OSWorld using data partly seeded from OSWorld's own configs, and the most novel idea — the verifier — has the least clean evidence behind it

Chapters

00:00The collapse: gold-standard human data poisons the model
02:11Why real human data caused negative transfer
04:23Infeasible tasks and the mise-en-place fix
06:35Seeding realistic, cluttered desktops
08:47Collapsing planner and actor into one model
10:59Turning one trajectory into many training samples
13:11The results and what diversity actually helps
15:23Complexity, robustness, and the keyboard shift
17:35The caveats: distillation, benchmark fit, and weak evidence for the verifier

References in this episode

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments — The exact benchmark this episode's results live and die on — essential for asses
Distilling the Knowledge in a Neural Network — The foundational distillation paper behind Finn's central reframing that 'synthe
STaR: Bootstrapping Reasoning With Reasoning — A canonical example of a model generating its own training data and learning onl

Full transcript

Also available as a plain-text transcript page.

0:00Juniper: An NVIDIA team takes one of the best open-source computer-use agents available — a model that already knows how to look at a desktop and click around — and they do the most obvious thing you can do to make it better. They fine-tune it on the largest pile of human demonstrations anyone has ever released: twenty-two thousand five hundred recordings of real people doing real tasks on real computers, across three operating systems. Hours of human expertise. And the model gets worse. Not slightly worse — it falls off a cliff. On the benchmark they care about, it goes from succeeding at about one task in four down to maybe one in ten.

0:40Finn: Worse than where they started — before any of the human data ever touched it.

0:46Juniper: Substantially worse. They poured in the gold-standard data and it actively damaged the model. That collapse is the puzzle at the center of the "ProCUA-SFT Technical Report," which went up on arXiv on June fifteenth, twenty-twenty-six, and we're recording three days later. Quick note before we dig in: this episode is AI-generated. The script was written by Anthropic's Claude Opus 4.8, and the two voices you're hearing — I'm Juniper, and that was Finn — are both AI voices from Eleven Labs, with a producer who isn't affiliated with either company. And the reason that collapse is worth a whole episode is what the team did next: they threw out the human data entirely, generated their own training set with no people in the loop at all, and pushed the same model up to forty-five percent.

1:36Finn: Forty-five, from a baseline around twenty-six. So the synthetic data didn't just undo the damage — it nearly doubled where they started. I want to come back to whether that number is as clean as it sounds, because there's a framing question buried in it. But the collapse first — Juniper, why does real human data poison the model?

1:57Juniper: Three reasons, and they compound. The first is that the human tasks are too easy. Most of those demonstrations live inside a single application — open one app, do one thing, done — with a median length around seventeen steps. The benchmark they test on is full of tasks that bounce across applications: pull a number out of a spreadsheet, look something up, drop it into a slide. So the human data never trains the cross-application reasoning the test actually demands. Second, it's crowd-sourced, which means annotation noise — people fumbling, taking weird paths, mislabeling what they did. And third, taken together, you've got a model that already had real skill, and you spend thousands of steps teaching it to imitate easy, slightly sloppy, single-app behavior. The analogy I keep coming back to is a tennis player who's already pretty good, and then spends months watching thousands of recordings of casual weekend players. The footage is real — real people playing real tennis — but it's easy rallies and loose form, and none of it looks like a competitive match. You come out the other side having absorbed habits that don't transfer. More practice footage, worse player. That's negative transfer, and it's the thing that makes "more real data made it worse" feel natural instead of paradoxical.

3:23Finn: Right — and once you've diagnosed it that way, the instinct is "fine, generate harder data, more cross-app stuff, cleaner trajectories." But the moment you ask a model to invent its own computer tasks, you walk straight into the nastiest failure mode in synthetic data.

3:40Juniper: The model invents tasks that can't actually be done.

3:45Finn: Exactly. It writes a goal like "open the Q3 report sitting on the desktop" — and there is no Q3 report. The file doesn't exist, or the app it needs isn't installed. And the paper has a sharp line about why that's not just wasted effort. Infeasible tasks, they say, waste rollout budget and — worse — teach the model to invent state. You're not just burning compute on something impossible. You're training the agent to act as if files and apps exist when they don't. A hallucinated task breeds a hallucinating agent.

4:18Juniper: And that's the first real design move in the pipeline — the fix for exactly that. Think about a careful cook doing mise en place. Before they commit to a recipe, they lay every ingredient out on the counter and confirm it's actually there — butter, eggs, flour, yes, yes, yes — so they never get halfway in and discover there's no butter. The pipeline does the same thing. Before it attempts any task, the model has to commit to a goal plus a little checklist of yes-or-no preconditions: does this spreadsheet exist on the desktop? Is LibreOffice Calc installed? Then a separate pass — same model, different job — goes and checks each item against the actual state of the machine. Only if every box is ticked does the task get attempted. And when a check fails, that failure gets fed back into the next attempt, so the generator steers toward what this particular machine can actually deliver. There's even a nice wrinkle: because the setup configuration is handed to both the generator and the judge, the loop can reason about things that will exist but aren't on screen yet — a file an upload step is about to drop, a server about to launch in a terminal. So it isn't trapped by whatever happens to be visible at the start.

5:36Finn: That second-order bit is genuinely clever, but let's not skip the content problem, because it's the other half.

5:43Juniper: Right — and it solves a completely different bottleneck. The hardest, most valuable tasks aren't limited by what the agent can do. They're limited by what's on the desktop. You can't ask an agent to find the total in a quarterly budget if the desktop only has empty spreadsheets. So they seed the machines with genuinely complex real-world documents. Over nine hundred real spreadsheets pulled from a collection of messy Excel-forum files — we're talking tables with more than a hundred columns and over twenty thousand rows. Around ten thousand real presentations. These are not toy files.

6:20Finn: And there's one detail in here I genuinely love. The presentations come from a repository where co-published files are grouped under a parent record — so instead of dropping in one file at a time, they upload whole clusters of related files together. Which means the desktop ends up realistically cluttered: multiple versions of a thing, several related documents sitting side by side. And that's what makes a whole class of task even meaningful — compare two versions, find the file by name. Those tasks only exist if the desktop looks like a real person's messy desktop. It's a small thing that quietly does a lot of work.

6:59Juniper: That clutter pays off later, Finn — but the move that's really the heart of the paper comes next. In a typical setup, you'd use two models: a strong, smart planner that proposes the tasks, and a cheaper actor that carries them out. And that sounds reasonable until you see what it does to your data. Two project managers. The first only ever assigns the team work they can actually finish with the tools they have, so almost everything gets completed, and you end up with a clean record of successful projects to learn from. The second dreams big — hands over impossible deadlines, asks for things the team can't deliver — and the team fails most of the time, so the files fill up with botched attempts. A strong-planner, weak-actor setup is the second manager. The planner keeps proposing goals the actor can't reach, and you harvest failure after failure.

7:55Finn: So the fix is to fire the second manager.

7:58Juniper: The fix is to make it one person. One single vision-language model plays every role — it generates the goal, it judges feasibility, and it executes the task. And because the thing proposing the goal is the exact same thing that has to carry it out, it never proposes something beyond its own ability. The paper's phrasing is that this closes the planner-actor capability gap: the model never proposes goals beyond what it can carry out. The gap doesn't get tuned away or filtered out after the fact. It closes by construction. There's nothing left to mismatch.

8:35Finn: And that's elegant — it really is. But notice what it quietly costs you, because I think it's the single most important caveat in the paper and I want to flag it now and come back to it. If the proposer and the executor are the same model, then the entire dataset is bounded by that one model's competence. The good manager who only assigns doable work also never stretches the team. Hold that thought.

9:02Juniper: I will, because it matters. But there's one more design choice, and it's the one most likely to be misunderstood. Once you have a successful trajectory — a full run of, say, thirty steps — how many training examples do you think you get out of it?

9:19Finn: One run, one example, presumably. That's the trajectory.

9:23Juniper: That's the natural guess, and it's wrong in a useful way. They turn one trajectory of thirty steps into thirty separate training samples.

9:33Finn: Wait — so that's data augmentation? You're just duplicating the same run thirty times with little tweaks to pad out the dataset?

9:41Juniper: Not quite — and the difference is the whole point. Think of a single recorded chess game. That game isn't one lesson. It's a lesson at every move: given the board looks like this, and these moves came before, play that. A forty-move game gives you forty study cards, each capturing the exact position a player faces at that moment. Nothing is distorted or duplicated — each card is a real, distinct decision point. That's what they're doing. Each of the thirty samples reproduces the precise screen and history the agent would actually see at that step during a real run.

10:20Finn: So the point is matching what inference actually looks like, more than padding the count.

10:27Juniper: That's exactly it. And there's a subtle part that makes it work. They only keep the three most recent screenshots as actual images, and older steps get compressed into a short text summary. That same windowing gets reproduced exactly when they build the training samples — so the model trains on precisely the context layout it'll see when it's actually running. No mismatch between practice and the real match. So put it all together. They fine-tune that same model on the synthetic data — three point one million training samples, drawn from ninety-three thousand trajectories, across nearly twenty-five hundred different application combinations. And on OSWorld — which, to ground it, is a benchmark where the agent has to actually accomplish real tasks on a real computer and gets scored pass or fail on whether it succeeded — the model climbs steadily to forty-five percent. The base model was at twenty-six point three. The human-data version had collapsed to around ten. So that's almost nineteen points over baseline, and a thirty-five-point gap over the human demonstrations.

11:36Finn: And the picture of that is striking — two training curves starting from the exact same point, diverging immediately. One climbs and keeps climbing. The other sinks below where it started and just flattens out down there. Same model, same starting line, opposite directions. There's also a tell in how the two datasets differ that I find genuinely persuasive. The human demonstrations are about sixty-three percent mouse clicks. The synthetic trajectories are only around forty-one percent clicks — they shift toward keyboard shortcuts and typing. And the authors' argument for why that matters is a nice intuition: keyboard actions are far less brittle than pixel-accurate clicks. If a button moves three pixels, your click misses. A keyboard shortcut works regardless of layout. So the synthetic data isn't just harder — it's teaching a more robust style of interaction. Now I want to spend a minute on the result I think is the quietest, most interesting thing in the whole paper. They ran an experiment on diversity — given a fixed budget, how should you select which trajectories to train on? The obvious answer is "make it as varied as possible." So they tried balancing by action type — make sure you've got a good mix of clicks and keypresses and scrolls. And they tried balancing by which combination of applications a task used. And the surprise is this: balancing by action type, at around twenty-five percent, actually did worse than doing no diversity-aware selection at all, which sat at twenty-seven.

13:10Juniper: So being deliberately diverse hurt you.

13:14Finn: Some kinds of diversity hurt. The only strategy that beat the baseline was round-robin by application combination — making sure you cover the different sets of apps that show up together. That one hit almost thirty-one percent. And the lesson is sharp: which apps you combine matters far more than balancing the mix of action types. It's like training for a specific competition — obsessively rotating your equipment types matters less than practicing the actual event pairings you'll face.

13:45Juniper: And that connects to something they found when they mapped out what the agent was actually doing on each task — drawing every trajectory as a graph of screen-to-screen transitions. The headline there is that complexity isn't about how many apps you touch. It's about the pattern of cross-referencing between them. They've got one vivid example: an invoice-extraction task that bounces between the file manager, a document viewer, and a spreadsheet — ten cycles, thirteen backtracks, nine app switches, the most tangled trajectory in the whole set. Two tasks can both use four apps; one is a clean straight line, the other is that mess. The mess is the hard one.

14:28Finn: Which is the dinner-party point, really. Cooking four dishes one after another is four easy steps. Timing a four-course meal so everything lands together, running between stations, re-checking — same number of dishes, completely different difficulty. So let me come back to the caveat I flagged, because I think it reframes the whole headline. The entire pipeline is driven by one strong vision-language model — proposing, judging, and executing. Which means what's really happening here is distillation: you're transferring competence from a strong model into a smaller one. The pipeline can't teach the student to do anything the teacher couldn't already do. And the paper doesn't disentangle how much of the forty-five percent comes from the clever pipeline versus simply from having a more capable teacher in the loop. So "synthetic beats human" is true — but the more precise claim is "a well-orchestrated strong teacher beats a pile of casual human recordings." Which is a different, and frankly more defensible, statement.

15:31Juniper: I think that's fair, Finn, and the project-manager analogy actually predicts it — the manager who only assigns doable work also caps how far the team can ever grow. Though I'd push back gently: even if it's distillation, the contribution is the orchestration. Knowing that you should collapse the roles, verify feasibility, seed real content — that recipe is the portable insight, and it's open. Anyone with compute can run it.

15:58Finn: Agreed on the recipe, Juniper. My bigger worry is the evaluation. Everything here is OSWorld. And the synthetic data is partly seeded with OSWorld's own configurations and deliberately tuned to match its app distribution — LibreOffice-heavy, the same apps the test loves. So a skeptic has to ask: how much of the gain is genuine capability, and how much is just fitting the shape of the one benchmark you're measured on? There's no second benchmark showing it generalizes anywhere else.

16:28Juniper: That's the one I'd most want them to answer too. Although — to steelman the design — matching the distribution of the tasks you actually care about isn't automatically cheating. That's arguably just sensible training. The line between "training for the event" and "overfitting to the event" is genuinely blurry here.

16:48Finn: Two more, quickly. The headline comparison isn't perfectly clean — the synthetic run trained for around forty-eight hundred steps to reach its peak, while the human-data version plateaus after roughly seven-fifty. The datasets are wildly different sizes, so "one epoch" means very different amounts of training. The fairest read is that the synthetic data recovers baseline at comparable step counts and then keeps climbing — still a real win, but a more nuanced story than "forty-five versus ten." And the precondition verifier — the mise-en-place step they emphasize so heavily — was only run in full on part of the released data. The rest used a single-shot version with no judge at all. So there's no clean ablation turning the verifier on and off to isolate how much it actually contributed.

17:39Juniper: That last one bugs me too, honestly. The verification loop is the most novel idea in the paper, and it's the one we have the least clean evidence for. What keeps this from being a pure benchmark artifact, though, is that a subset of this data went into the training mix for one of NVIDIA's actually-shipped models — Nemotron 3 Nano Omni. So it cleared the bar of being useful enough to put into a real product, not just a number on a leaderboard. And that's the shift I'll take away. The assumed path to better computer-use agents was always "pay more humans to demonstrate." This says the bottleneck isn't access to annotators — it's the design of the loop that generates the data. And the planner-actor idea travels: anytime you have a strong model generating tasks for a weaker one to learn from, you systematically overshoot and harvest failures. Collapsing the roles is a structural fix, not a tuning trick.

18:35Finn: I buy the structural insight completely. I'm just still not convinced the forty-five percent tells us what it looks like it tells us. Until someone runs this against a benchmark it wasn't seeded from, I read it as a strong result about distillation and careful orchestration, not proof that synthetic has dethroned human data in general. Real jump, genuinely clever pipeline — open question on how far it actually reaches.

19:00Juniper: And worth saying plainly — forty-five percent means succeeding at fewer than half the tasks. This is a big step in an early field, not a solved problem. Which is maybe the honest place to leave it: the most obvious way to build these agents was making them worse, and the fix was to stop recording humans and start orchestrating a model carefully enough that it only ever teaches itself things it can actually do. Whether that scales past one benchmark is the next paper.

19:29Finn: The paper and a few related reads are in the show notes if you want to pull on this thread yourself.

19:35Juniper: And if you want the full transcript with every term defined inline, plus the links over to other episodes that touch these same ideas, that's all on paperdive.ai. Thanks for listening to AI Papers: A Deep Dive.

Why More Human Demonstrations Made a Computer-Use Agent Worse

Watch

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes