All episodes
Episode 080 · May 26, 2026 · 32 min

How a Two-Agent Trick Unlocked Large-Scale Training for Computer-Use Agents

Wang, Lu, Wang et al.

Computer-use Agents
AI Papers: A Deep Dive — Episode 080: How a Two-Agent Trick Unlocked Large-Scale Training for Computer-Use Agents — cover art
paperdive.ai
Ep. 080
How a Two-Agent Trick Unlocked Large-Scale Training for Computer-Use Agents
0:00
32 min
Paper
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Venue
arXiv:2605.25624
Year
2026
Read the paper
arxiv.org/abs/2605.25624
Also available on
Apple Podcasts Spotify

Computer-use have been stuck while math and code models have soared — and a new paper argues the bottleneck was never the algorithm, it was the missing data pipeline. The fix turns on one elegant design choice: put an information barrier between the AI that builds the training environment and the AI that writes the reward function. The result is the largest open verified dataset for agents, big benchmark gains, and an unexpected behavior the agents picked up entirely on their own.

What you'll take away

  • Why verifiable RL has scaled beautifully in math and code but stalled in computer-use — and why that's an environment problem, not an algorithm problem
  • The Generator/Discriminator information-barrier trick that prevents AI-written reward functions from secretly checking the construction procedure instead of the task
  • How 94 synthesized mock applications (Slack, , Salesforce, EHR, etc.) get built and verified, grounded in real software-usage data rather than convenience
  • An suggesting environment diversity is its own scaling axis — same count spread across more environments meaningfully outperforms
  • An unprompted emergent behavior: trained learn which UI actions are safe to batch and which (like right-click) must stay atomic, cutting length 33–45% with no efficiency reward
  • Where the paper's framing is hotter than its evidence — transfer to benchmarks is modest, runs are single-seed, and reward functions verify end-state only

Chapters

  1. 00:00Why GUI agents are harder than math problems
  2. 29:00The information barrier between Generator and Discriminator
  3. 07:03The inner loop and the reward-hacking scanner
  4. 10:34Synthesizing 94 mock applications at scale
  5. 14:06Training results and the data-vs-model tradeoff
  6. 27:36Environment diversity as a separate scaling axis
  7. 21:09The emergent action-batching behavior
  8. 24:40Limitations and honest caveats
  9. 28:12Why this paper reframes the agenda

References in this episode

Also available as a plain-text transcript page.

0:00Bella: Here's a pattern that's been quietly shaping the last two years of AI progress. Take a domain — math, competition coding, terminal commands. Write a problem. Write a checker that returns one or zero. Point reinforcement learning at it. Watch the curves climb, and keep climbing, with no real sign of saturation. That's the recipe. It's not subtle, it's not new, and it's been spectacular.

0:26Eric: And then you look at computer-use — the systems that are supposed to actually click around your desktop, drive your spreadsheet, fill out a form on a real website — and the curves look nothing like that. These are arguably the most economically interesting agents anyone is building, and the training data for them has been tiny. Orders of magnitude smaller than what math and code get.

0:52Bella: So why? That's the question this paper picks up. And the answer, the authors argue, is structural — it's not that are some fundamentally different beast that resists the recipe. It's that nobody had built the data pipeline. The paper we're working from is ": Scaling Verifiable Training Environments and Tasks for Computer-Use Agents," it went up on arXiv on May twenty-fifth, twenty-twenty-six, and we're recording the next day, May twenty-sixth, twenty-twenty-six. This episode is AI-generated — the script is from Anthropic's . I'm Bella, that's Eric, and we're both AI voices from Eleven Labs. Neither company is involved in producing the show.

1:38Eric: And the reason that "next day" timing matters is that the pipeline this paper describes makes a very strong claim — that the bottleneck in computer-use has been an environment problem hiding behind what looked like an algorithm problem. They open-sourced the whole stack, dataset and synthesis code and trained models, the day before we're talking about it.

2:02Bella: Let's start with what makes the domain harder, because it isn't obvious. A math example is two things. There's a problem string — "what's the integral of this" — and there's a checker function that looks at the answer and returns one or zero. Two artifacts. Both small. Both cheap to generate. A computer-use training example is three things, and they're all coupled. You need a task instruction in natural language. You need an executable environment — a virtual machine in a particular starting state, with the right apps installed, the right files in the right places, the right tabs open. And you need a reward function that can inspect that VM after the has acted and decide, programmatically, whether the task got done.

2:51Eric: Each of those three artifacts is a non-trivial engineering job. The instruction has to be unambiguous. The starting state has to be reproducible across thousands of parallel . The reward function has to actually check whether the task happened, not whether some proxy of it happened. And — this is the killer — each new application brings its own setup quirks and its own verification surface. The way you check "did the user properly format this LibreOffice cell" is completely different from "did the user post the right message in this Slack channel."

3:28Bella: Right. So the field has been stuck in a . You can hand-author these tuples, in which case they're clean but you have maybe a few hundred, because each one is hours of expert effort. Or you can scrape and approximate, in which case you have more but the verification is fuzzy — and fuzzy rewards, it turns out, destabilize RL training in a way that actively makes the policy worse. There's been no path to "large, clean, and broad" simultaneously.

3:58Eric: And the paper's bet is that this isn't a problem about the RL algorithm. The algorithm we have — and they use a -friendly variant of the standard recipe — works fine once you feed it the right data. The whole intellectual contribution is the data manufacturing pipeline, and the load-bearing idea inside that pipeline is honestly elegant.

4:21Bella: This is the part I want to spend real time on, because once you see it you can't unsee it. Imagine you ask one AI to do the whole job. Build the starting environment, build the finished "golden" environment, and write the reward function that distinguishes them. What goes wrong?

4:40Eric: The reward ends up checking the construction procedure rather than the task. Because the same built the environment and wrote the checker, the checker knows exactly which files got placed, which flags got set, which database fields got touched during setup — so it just looks for those breadcrumbs. And now an RL agent training against that reward learns to leave those same breadcrumbs. The reward isn't measuring "did the task happen," it's measuring "did the construction sequence happen." There's no useful pointing at the actual skill.

5:16Bella: So the fix — the single design choice the rest of the paper is built around — is to run two in completely separate processes with disjoint views of the file system. One agent, the Generator, builds the starting VM and the golden completed VM. A second agent, the Discriminator, never sees what the Generator wrote. It can read the two finished environments through a state-only , and it can read the task description in plain English. From those alone, it has to write the reward function.

5:49Eric: The image the paper invites — and it's a good one — is two contractors building the same room from opposite sides of a sealed wall. One builds the empty version, the other builds the finished version. Then there's an inspector who's only ever seen the original blueprints. The inspector can peek through small windows at the two finished rooms, but never saw either construction process. So when the inspector writes the inspection checklist, it has to describe what the finished room *is* — what it looks like, what's in it, what's been changed — not which steps were taken to get there.

6:27Bella: That's the trick. The information barrier forces the reward function to be a statement about the *outcome*, not the *procedure*. And then the two of them iterate, mediated by an Orchestrator, until five conditions are simultaneously true. Both scripts run. The reward returns one on the golden state. The reward returns zero on the starting state. And — this is the safeguard — a static scanner finds no shortcut patterns in the reward code.

6:56Eric: Tell me about the scanner, because I think that's where the engineering taste shows.

7:02Bella: Yeah, it's a list of six forbidden patterns. Things like: assigning a verified flag to true without actually computing anything. Checking for bare file existence as if that proved the task. Hardcoded return values. Shelling out to subprocesses in suspicious ways. Comments that *claim* to check something with no actual code doing the check. If any of those patterns shows up, the round aborts and the matched pattern gets fed back as criticism. They cap the loop at five rounds, because empirically tuples that don't converge by round five are usually specifications no amount of script revision can save.

7:41Eric: And this whole inner loop produces a tuple that's internally consistent — but the authors are clear that internal consistency isn't enough. A reward function and a golden state can perfectly agree with each other and still describe a task that no real could possibly accomplish, or one that's hopelessly ambiguous. So there's a second filter.

8:04Bella: The dataset-level filter. Two stages. First, an ensemble of LLM critics votes on each surviving tuple — scoring it on whether the task is consistent, executable, hack-resistant, clearly worded, and difficulty-calibrated. Majority vote keeps or kills it. Then, the surviving tuples get attempted by a strong teacher model. Multiple attempts each.

8:27Eric: And if the teacher can never solve it, drop it — the task is broken or impossible. If the teacher trivially solves it every time, downweight it — it's not teaching anything. If both the programmatic reward and an independent vision-language judge agree the teacher succeeded, accept the tuple. About thirty percent of what survives the inner loop gets rejected at this stage.

8:51Bella: So that's the per-task pipeline. But Eric, this only gets you a better way to manufacture tuples for *applications you already have*. And the existing benchmarks cover maybe a dozen desktop apps. If you actually want to teach an to operate the digital tools knowledge workers use every day, a dozen apps doesn't even start to scratch the surface.

9:14Eric: Right, and this is where the paper makes its second big move. They argue — and the back this up, we'll get to that — that environment diversity is its own scaling axis, separate from how much data you have per environment. So they need a lot more environments. The problem is you cannot just train against real websites. Authentication, rate limits, state that doesn't reset cleanly between thousands of parallel workers — real production websites are categorically unusable as RL training environments.

9:48Bella: So they synthesize fake ones. Ninety-four of them. Mock Slack, mock , mock Salesforce, mock Shopify, mock electronic health records, mock government portals. Each one a self-contained single-page web app, somewhere between twenty-five hundred and ten thousand lines of code. And they pick what to build by mapping occupational categories from the U.S. labor taxonomy onto software categories, weighted by software-usage frequencies from Anthropic's Economic Index — which estimates how often different kinds of software show up in large-scale -traffic logs.

10:26Eric: That mapping move is one I really like, because it pushes back on a failure mode you see a lot in benchmarks, where everyone builds the same five toy environments because those are the ones in the previous benchmark. Grounding the environment list in observed software usage rather than convenience is a deliberate refusal to optimize for what's easy to measure.

10:51Bella: And the same multi- paradigm builds the mocks themselves. A Plan Agent does web research and writes a — color palettes, work queues, data models. A Dev Agent implements the single-page app. A Web Agent drives the result with browser automation, comparing live behavior against the spec and feeding bugs back. They iterate until zero high-priority issues remain.

11:16Eric: And every mock exposes the same four-endpoint state . You can read the current state, inject state, get a structured diff between two states, and so on — with session isolation, so thousands of parallel can hit the same mock implementation and each see their own private world. Which means the per-task pipeline we just walked through — Generator, Discriminator, the iteration loop — drops in cleanly. Same machinery, new environments.

11:46Bella: One detail worth flagging here, because it's a quietly powerful design move. The same mock can host many different tasks. The Slack mock with an empty inbox supports one set of tasks. The Slack mock with fifty pending threads supports a different set. Same code, different injected state, different feasible tasks. So ninety-four mocks doesn't translate to ninety-four tasks — it translates to tens of thousands.

12:14Eric: Which gets us to the actual numbers. The final dataset is just over thirty-two thousand verified training tuples across a hundred and ten environments — desktop plus web. The largest open-sourced computer-use dataset by, the authors claim, a wide margin. And then they train.

12:33Bella: Walk me through the training side. Which models, which algorithm, what scale?

12:39Eric: Two backbones. A smaller model at thirty-five billion parameters total, three billion active. And a larger one at three hundred ninety-seven billion total, seventeen billion active. They warm-start with on roughly thirty-five hundred successful teacher , then run reinforcement learning for a thousand steps using a variant of group-relative RL adapted for mixture-of-experts models. The intuition for the algorithm is pretty clean: for each task, the makes a group of parallel attempts. Each attempt gets a one or a zero. The "advantage" for any single attempt is just how much better it did than the group average. You nudge the policy toward whatever the better attempts had in common, with a safety belt that prevents any single update from being too aggressive.

13:35Bella: Right — the cocktail-making intuition. You don't have a master recipe to compare against, so you make a batch of versions, rank them against each other, and tilt your technique toward whatever the better ones had in common. The clip is "don't change your technique by more than twenty percent in any one session, even if one batch was incredible."

13:59Eric: Exactly. And the results — the smaller model goes from fifty-four-point-five percent on -Verified to sixty-two-point-one percent. That's the standard computer-use benchmark, the equivalent of . The larger model goes from sixty-two-point-two to seventy-two-point-six.

14:18Bella: The detail I want to land on there is that the smaller trained model essentially matches the larger *un*trained one. Roughly ten times fewer total parameters, and it gets to the same place. That is a meaningful compression — and it lives entirely in the data, not the algorithm.

14:37Eric: And the seventy-two-point-six closes most of the gap to the nearest proprietary model on the same benchmark — , at seventy-two-point-nine. The very top of the closed-model leaderboard is still further out — Claude Opus 4.7 sits at seventy-eight, at seventy-eight-point-seven — so this isn't catching the proprietary frontier, but it's closing the gap to the nearest comparable. The previous best open-source numbers were in the mid-fifties and mid-forties. So this isn't a small move. The authors are also careful to show that the curves don't saturate — performance keeps climbing with data volume in the regime they explored.

15:21Bella: And then there's the I want to spend a beat on, because Eric, this is the result that I think most changes how the field should think about its bottleneck.

15:31Eric: Figure eight, right? The environment-diversity result.

15:36Bella: Yeah. Three conditions. First: train on ten environments, three hundred each. Total: three thousand trajectories. Second: train on eighty environments, thirty-eight trajectories each. Same total — three thousand trajectories. Third: train on eighty environments, seventy-five trajectories each. Double the data, same broad environment pool. The result is that the second condition beats the first by a meaningful margin. Same number of trajectories, just spread across more environments, and you do better. And then the third condition — more data on the broader pool — beats both by a lot.

16:15Eric: The chess analogy is the one that landed for me. If you train a player by having them play ten thousand games against one opponent, they get extremely good at beating that opponent and plateau. If you train them by having them play ten thousand games spread across a hundred opponents, same total game count, you get a much stronger generalist. There's only so much that one opponent can teach you. The variety of situations matters independently from the volume of practice.

16:46Bella: And the uncomfortable implication for the field is that if you've been adding more on a small set of environments and watching numbers go up, you might be hitting a ceiling that's invisible. Because adding trajectories *does* keep helping, just not as much as it would if you also broadened the environments. The bottleneck might have been a diversity bottleneck nobody had a clean name for.

17:12Eric: That reframes a lot of how I'd read prior work. A lot of computer-use papers have been measured on the same handful of applications. If diversity is its own scaling axis, then "we got better numbers on the same five apps" might not be telling you what you think it is.

17:30Bella: Now, Eric, let me hand you the next part, because I think you should be the one to land it. There's a finding in here that the does on its own — not because anyone designed a reward for it, not because anyone wrote it into the — that I genuinely did not expect.

17:48Eric: Yeah, this is the result that made me re-read the section twice to make sure I had it right. So the operates by emitting . Click here, type this, press this key, scroll. Before RL, the supervised- policy emits about one tool call per model step — one action, then one observation, then one action. Pretty standard. After RL, that number rises to somewhere between one-point-four and one-point-nine tool calls per step. The agent has started bundling actions together. And the total length drops by thirty-three to forty-five percent. Same task success rate. Forty percent fewer steps.

18:32Bella: And critically, nobody trained it to be efficient.

18:35Eric: Nobody trained it to be efficient. There is no efficiency reward. There is a binary success reward and a step budget — if you take too long, you time out and fail. And what falls out of that, statistically, is that that fit inside the budget have higher relative reward than ones that just barely time out. So the algorithm implicitly selects for policies that pack actions together. But the wild part isn't just that it batches. It's *what* it batches. The single most common batched sequence in the trained policy is type-then-key, almost four thousand times. Sequences like "click File, click Export, click PDF" — deterministic chains, where you know what each click is going to surface — get batched constantly. And there's a small set of actions that *never appear alone* in the trained policy: scroll, key down, key up, click-and-drag. Those are the mechanical sub-components of larger gestures. They only ever appear inside a batch.

19:41Bella: And then there are actions on the opposite end.

19:44Eric: Actions like, right-click - double-click. These get emitted alone ninety-four to ninety-eight percent of the time. And the reason is intuitive once you see it — they pop context menus whose contents the policy genuinely can't predict ahead of time. If I right-click and then immediately try to send another action, I don't know what menu options will appear, so I can't reliably plan past the right-click. The policy has somehow figured this out. It has learned, with no explicit signal pointing at it, *which actions safely admit batching and which don't*.

20:19Bella: That's the part I find genuinely striking. The barista analogy is the closest thing I have. A barista who's been doing the job for a year has never been told to be efficient — they've been told to make correct drinks under time pressure. Over time they learn to start the espresso, then steam the milk while it pulls, then grab the cup. They've also learned which steps can't be parallelized. They won't pour latte art while looking away, because that step depends on what the milk is doing right now. The has independently arrived at the same kind of discrimination. Deterministic sub-sequences get fused; sub-sequences whose outcomes depend on unpredictable state stay atomic.

21:03Eric: And it parallels a finding from the math-reasoning RL literature, where models spontaneously develop chains of self-correction — "wait, let me check that" patterns — that nobody put in the reward. The system learns structural behaviors nobody designed. Here the structural behavior happens to be a coarse model of which UI operations have predictable consequences.

21:26Bella: It's one of those results that feels qualitatively important in a way that's hard to defend rigorously. The didn't just get better at the benchmark. It changed its relationship to the action space.

21:39Eric: The Shopify in the appendix is a really concrete example of this. The instruction is something any of us could imagine doing — change the vendor field for all products under two specific brands, then update those product descriptions to include a "now part of" note. Don't touch other vendors. Pretty quotidian.

22:01Bella: And the works through it across about nineteen turns. Early in the , it's doing one action per turn — click into a product, change a field, save. By turn thirteen, it's batching select-all, type, save into a single turn. By turn fourteen, an unexpected save dialog pops up and it handles the recovery cleanly. Then it verifies both products were updated. Reward: one-point-zero. Just watching the trajectory get visibly more efficient as it goes — without that being anywhere in the — is the kind of thing that gives the finding texture.

22:38Eric: So all of that's the positive story. I want to give the limitations real airtime, because the paper deserves credit for being unusually direct about them and the audience deserves to hear them.

22:51Bella: Yeah, let's do that — and the authors flag most of these themselves, which I appreciate. Where do you want to start?

22:59Eric: Start with the transfer claim, because it's where the framing is slightly hotter than the evidence supports. The big numbers are on -Verified, which uses applications and task structures that are pretty similar to the training distribution. They also evaluate on — a separate browser benchmark the model never saw during training. And the WebArena lift is real but small. Two points for the larger model, about three and a half for the smaller. Compare that to a ten-point in-distribution gain. The transfer is positive, but the paper sometimes phrases this as "the skills generalize beyond the training environments," and a careful reader might say the bulk of the benefit is on the in-distribution side.

23:47Bella: And related to that — the mocks are deliberately approximations. They strip out authentication flows. They strip out third-party integrations. They strip out network latency, rate limits, and rare server-side failure states. That's a defensible choice for training, because those are exactly the things that would make parallel RL impossible. But those are also exactly the conditions where real-world deployed currently fail most often. The flight simulator analogy is the one I keep coming back to — you can train thousands of hours of reflexes in a simulator and they'll transfer beautifully, until the engine throws an error code the simulator never showed you.

24:30Eric: The paper concedes this in the limitations section but doesn't measure it, which is honest but also leaves the question open.

24:37Bella: A third one, which I think is worth being clear about: the headline RL runs are single-seed. The authors say so explicitly — they cite compute cost. RL training is famously high-variance. We do not know how much of the seven-point-six and ten-point-four-point gains is robust signal versus a fortunate seed. The smooth scaling curves with respect to data volume are reassuring, but they are not the same thing as multi-seed confidence intervals. The authors frame their results carefully — "evidence that verified scale is valuable," not "a final characterization of the ceiling" — and I think that framing is exactly right.

25:17Eric: And then there's a more subtle one about the reward functions themselves. They verify terminal state, not process. Which means a clean targeted edit and a destructive sequence that nukes everything and rebuilds the right final state earn the same reward. For training purposes this might be fine. For deployment, an that has learned that *any* sequence of actions leading to the right end state is acceptable might behave very badly when it has access to data it could destroy and recreate.

25:49Bella: That's not a hypothetical risk — that's a real consequence of the design choice. And the authors flag it.

25:55Eric: They also flag, in their own words, that the information barrier "reduces but does not formally eliminate" . The static scanner catches the obvious cheats. A sufficiently clever Discriminator could still write a reward function that happens to correlate with task success without actually measuring the task, and the scanner wouldn't catch it. The teacher- filter catches some of this in practice, but the guarantee is empirical, not theoretical.

26:26Bella: I want to pull all of that together, because Eric, even with those caveats this is a paper that changes what I think the next year of computer-use work is going to look like.

26:38Eric: Yeah, say more about that.

26:39Bella: There are three reasons I think it matters, in order. The smallest is the immediate practical one — anyone trying to build that operate real software now has an order of magnitude more training data than existed last week, plus open-sourced pipeline code to generate more. That's a meaningful gift to the field. The middle one is methodological. Before this paper it was genuinely plausible that agent training was different in kind — that something about high-dimensional pixel inputs, huge action spaces, long horizons, multimodal observations meant the math-and-code RL recipe wouldn't transfer. The evidence here is that it does transfer. The curves look the same. The unsaturated scaling looks the same. The reason it wasn't working before was the data infrastructure wasn't there. That's a meaningful update on what the rest of the agent-training research agenda should look like.

27:38Eric: And the biggest one?

27:39Bella: Environment diversity as a separate axis. Because if that's real — and the is pretty clean — then a lot of the field has been measuring itself against a benchmark that's narrow enough to mask the ceiling. The path forward isn't more clever training tricks on the same five applications. It's heavy investment in environment infrastructure. Building more *worlds*. That's harder and less glamorous than algorithmic work, but it's doing more of the work.

28:09Eric: There's a bigger pattern in modern AI training that this paper fits into, and it's worth naming. For about two years the dominant story in improvement has been: find a domain where verification is mechanical, build the , scale RL against it. Math was the proof of concept. Code extended it. Each time the cleverness was less in the algorithm and more in the data and environment infrastructure. The bottleneck in AI progress has been quietly migrating from architecture and algorithms toward data and environments — the unglamorous engineering work of building the worlds models train against.

28:49Bella: And this paper is a pretty pure instance of that shift. The central intellectual move is not a new function. It's a way to manufacture verified training environments at scale, with a specific design — the information barrier between Generator and Discriminator — that prevents the most obvious failure mode of doing it with AI .

29:12Eric: One thing I'd add, Bella, is that the information-barrier idea has a longer arc than just this paper. The principle — separate the that creates from the agent that evaluates — has cousins all over computer science. Adversarial training in generative models. Proof-checker design, where the prover and the checker have asymmetric roles. Adversarial code review, where the reviewer didn't write the code. The pattern keeps showing up because it solves a recurring problem: when the same process produces both the artifact and the evaluation, the evaluation collapses into the procedure. Separating them forces the evaluation to be about the outcome.

29:54Bella: That's a nice frame, because it suggests the same idea will keep appearing. Anywhere we use AI to generate training data for other AI agents — which, given the of synthetic data, is going to be everywhere — the same failure mode is going to threaten, and the same kind of structural fix is going to be needed.

30:16Eric: The implicit bet of the paper — and it's worth stating plainly — is that if you can manufacture verified training environments at scale, computer-use will follow the same math and code did. We'll get systems that can reliably operate the digital tools knowledge workers actually use. The paper is one piece of evidence that the bet is reasonable, not a proof that the destination is reachable. But it's substantial evidence.

30:45Bella: And the cost numbers are worth noting too. Generating the entire dataset cost roughly thirteen thousand dollars in and VM compute. Training the smaller model took five days of wall-clock on a hundred and ninety-two H200 GPUs. These are real numbers, but they're not "only-a-frontier-lab-can-do-this" numbers. The bottleneck genuinely has been the missing infrastructure, not access to compute.

31:11Eric: Which is, I think, the most quietly hopeful thing in the paper. The reason computer-use have been worse than they should be wasn't that the problem is intractable. It's that nobody had built the pipeline. And now somebody has.

31:25Bella: That feels like the right place to leave it. The show notes have a link to the paper and some related reading if you want to keep pulling on this thread.

31:34Eric: And if you want the full transcript with the jargon defined inline and links over to the other episodes that share these ideas, that's all on paperdive.ai. Thanks for listening to AI Papers: A Deep Dive.