All episodes
Episode 047 · May 15, 2026 · 28 min

When Agent Benchmarks Lie: The Harness Problem in Open-Source AI

Peng, Yao, Wu et al.

LLM Agent Training Agentic AI Systems
AI Papers: A Deep Dive — Episode 047: When Agent Benchmarks Lie: The Harness Problem in Open-Source AI — cover art
paperdive.ai
Ep. 047
When Agent Benchmarks Lie: The Harness Problem in Open-Source AI
0:00
28 min
Paper
Orchard: An Open-Source Agentic Modeling Framework
Venue
arXiv:2605.15040
Year
2026
Read the paper
arxiv.org/abs/2605.15040
Also available on
Apple Podcasts Spotify

A software-engineering scores 62% on its native test setup and 3.6% when you swap the wrapper around it. A new paper called argues this isn't a bug in one system — it's an indictment of how the entire open-source agent field has been measuring progress, and it offers an infrastructure-first fix that costs ten times less and actually generalizes.

What you'll take away

  • Why most reported benchmark scores measure -fit rather than underlying , and how a cross-harness test exposes the gap
  • How treating the layer as a thin, generic service (rather than baked-in plumbing) cuts training costs roughly 10x versus managed services like and
  • Credit-assignment : extracting partial supervision from failed teacher by finding the rising segment before the critical mistake
  • (BAR): a self-pacing RL technique that stops generating once a prompt yields a useful mix of wins and losses
  • The surprising result — a 4B-parameter student beating its 235B teacher — and why environment-grounded RL teaches something can't
  • Honest limitations the paper undersells: the cross- comparison is partly confounded, RL gains are measured on a curated subset, and the whole recipe depends on a few open frontier teacher models staying open

Chapters

  1. 00:00The 62-to-3.6 collapse
  2. 03:27Agent, harness, and what Orchard actually is
  3. 06:54Sandboxes as a thin service
  4. 10:21Credit-assignment SFT: salvaging failed trajectories
  5. 13:49Balanced Adaptive Rollout for RL
  6. 23:17The cross-harness experiment
  7. 20:43When a 4B student beats a 235B teacher
  8. 24:11What the paper undersells

References in this episode

Also available as a plain-text transcript page.

0:00Bella: Here's a question that should bother you. You've trained an AI — a software engineer, say, the kind that's supposed to fix real issues. On your benchmark, in your test , it scores 62 percent. Solid number. Now swap the harness — same model, same task, just a different wrapper around it, a different set of tools, a slightly different way of formatting the conversation. Score drops to 3.6 percent.

0:27Eric: Three point six.

0:28Bella: Three point six. The model didn't get worse. The model is exactly the same set of . What collapsed is the illusion that the model ever knew how to do the job in the first place. And that result — that exact pattern of collapse — is what the paper we're digging into today is built around. The paper is called ": An Open-Source Agentic Modeling Framework," it went up on arXiv on May fourteenth, twenty-twenty-six, and we're recording the next day, on May fifteenth, twenty-twenty-six. This whole episode is AI-generated — the script is from Anthropic's . I'm Bella, that's Eric, and we're both AI voices from Eleven Labs. Neither company is involved in producing the show. And the reason that one-day turnaround matters is the paper drops a result that, if it generalizes, kind of indicts how the whole open-source community has been measuring progress.

1:25Eric: Right. And before we get to the indictment, let's name what's actually under the hood here, because is doing two pretty different things at once and they only make sense together. It's a piece of infrastructure — a way of running thousands of sandboxed Linux environments to train in — and it's a set of training recipes that sit on top of that infrastructure. The paper's argument is that you can't separate them. The plumbing shapes what's possible upstairs.

1:56Bella: And to make any of this concrete we need a quick vocabulary check, because the paper just assumes you know all this. When they say "" they don't mean a chatbot. They mean a language model running in a loop — it writes a few lines of reasoning, then emits a like "run this shell command" or "click this button," sees what happened, decides what to do next, and keeps going for dozens or hundreds of steps until it either solves the task or gives up. The standard scaffolding for that loop is called — reason, then act, then reason about what came back. That's the body the model lives inside while it's working.

2:37Eric: And the body has a name. The paper calls it the . That's the surrounding software — what tools the has, how its outputs get parsed, how errors are surfaced, what the looks like. The model is the brain; the harness is the nervous system. And the punchline that the whole paper builds toward is that the field has been confusing one for the other. A lot of what gets reported as "this model is great at software engineering" is really "this model is great at software engineering inside this specific harness," and the moment you swap the harness, the evaporates.

3:16Bella: Which is the 62-to-3.6 result. So hold that as the destination — we'll come back to it. The road to it starts with infrastructure, which I know sounds like the boring part of the paper, but Eric, you wanted to make a case here.

3:32Eric: I do, because the infrastructure case is where the dollar amounts live, and the dollar amounts are not small. Here's the situation in early twenty-twenty-six. If you want to train a software-engineering , you have to spin up sandboxes — isolated Linux environments, each with the right repository checked out, the right Python version, the right dependencies, the right test setup. Thousands of them in parallel, because the agent is going to take hundreds of actions per task and you're going to be running many tasks at once across many training steps. The question is who runs those sandboxes for you.

4:13Bella: And there are basically two options the open-source world has had.

4:18Eric: Two bad options. Option one is a managed service — companies like , , that rent you sandboxes by the hour. Easy to use, expensive at scale, and you can't really tune their performance. Option two is you bake the sandboxing into your training stack — every project rolls its own — and then nothing transfers. A dataset of collected with one project's runtime can't be reused under another project's trainer. So researchers either pay through the nose or they rebuild the wheel.

4:53Bella: 's bet is option three: pull the layer out into its own thin service, like the way databases got pulled out of applications in the nineties. The application doesn't care which database is underneath, the database doesn't care which application is on top, and there's a tiny, generic interface in between. Same idea here. The training code, the , the evaluation pipeline — they all just talk to a small . Create a sandbox. Run a command inside it. Read a file. That's the whole vocabulary.

5:29Eric: And the engineering claim is that if you draw the boundary in exactly that place — narrow, generic, -agnostic — you get composability for free. The same stack can back , RL , and evaluation. Datasets transfer. Recipes transfer. Now, that's the architectural claim. The empirical claim is that it's also dramatically faster and cheaper, and this is where the numbers get interesting.

5:58Bella: How interesting?

6:00Eric: A 128- training run, 240 hours of compute, on commodity spot instances. On , six hundred and seventy-three dollars. The same workload on or , about seven thousand. That's roughly a ten-x gap. The latency numbers are the same shape — average command latency is about a quarter-second on Orchard, versus two seconds on , which is the worst comparison point. And they stress-test it at a thousand concurrent sandboxes, full lifecycle, hundred percent success.

6:34Bella: I want to flag something here, because the comparison is real but it's not apples-to-apples. The ten-x cost gap is run by experts with their own cluster on spot instances, versus a managed service that handles everything for you. Those are different products for different users. For a research group with operational expertise, the savings are huge. For a group without that expertise, paying for the managed service may still be the rational call. The paper doesn't really sit with that distinction.

7:12Eric: Fair, and we should come back to that when we get to the steelman section. But hold onto the architectural shape, because it sets up the more interesting half of the paper, which is what happens when you build a training recipe on top of this substrate. Bella, you want to take the recipe?

7:32Bella: Yeah. So the team builds three different on top of — a software-engineering agent, a browser agent, and a personal-assistant agent — and the most interesting one to talk through is the software engineer, because that's where most of the recipe innovation lives. The headline result is that they take a model — 30 billion parameters total but only 3 billion of them firing for any given — and they get it to 67.5 percent on . SWE-bench Verified is the closest thing the field has to an objective measure of "can an LLM actually fix real bugs in real repos."

8:14Eric: And 67.5 is what kind of number?

8:16Bella: It's the kind of number that beats every dense 72-billion-parameter open-source recipe and starts approaching the frontier closed-and-open systems that are ten to thirty times larger in total parameters. So the question is: how does a model that small get there? And the answer is two pieces of recipe work. The first is called , and it's about the problem of failed .

8:43Eric: Which is a problem you'd think you'd want to ignore.

8:47Bella: That's exactly what everyone does ignore it. The standard recipe is: have a powerful teacher model attempt the task, keep the where it succeeded, throw away the ones where it failed, train the student to imitate the successes. Clean, simple, throws away about thirty percent of your data because that's how often even strong teachers fail on hard tasks.

9:13Eric: And the authors look at that thirty percent and ask whether there's something to salvage.

9:19Bella: They do. Imagine a chess coach reviewing a game you lost. The naive approach is "you lost, don't play any of those moves again." The smarter approach is to walk through the game and say "moves one through twelve were actually fine, you were ahead the whole time. Move thirteen is where you went wrong. So learn from one through twelve as good play, and treat thirteen as a lesson." That's the whole idea. They take a failed , hand it to the teacher LLM, tell the teacher it failed, and ask the teacher to estimate at each step how likely the was to eventually succeed. Where that probability is rising, the agent was making real progress. Where it falls off a cliff, that's where things went wrong. Train only on the rising sections.

10:07Eric: And there's a small thing here worth pausing on. The "coach" is the same model that produced the failed game. It's just being shown the outcome in retrospect. So you're using hindsight to extract supervision the original attempt couldn't have generated on its own. And the authors find that across their annotated set, this hindsight value curve is shaped like an inverted U almost ninety-nine percent of the time. It rises during the productive part, peaks, then drops at the critical mistake. So the "rise segment" notion is actually picking out a real structure in how fail, not just imposing one.

10:47Bella: And the practical upshot is that thirty percent of teacher that used to be garbage are now partial supervision. That's the piece. The RL piece is the second recipe contribution, and Eric, this is your favorite.

11:02Eric: It's called , BAR for short, and it solves a problem that's specific to how RL works for . So the standard algorithm here is called — you don't need the acronym, just the idea. For each training prompt, you generate, say, eight from the current model. You see which ones succeeded and which failed. The model learns by comparing — "the successful ones did X, the failed ones did Y, do more X." That comparison is the .

11:33Bella: And the problem is when there's nothing to compare.

11:36Eric: Right. If all eight succeeded, there's no contrast and no learning signal. If all eight failed, same thing. You burned the compute generating eight for zero . And in training, where each rollout might take many minutes of sandboxed execution, that's catastrophically expensive. The naive fix is to pre-filter prompts based on historical difficulty, but that requires bookkeeping you don't have on a fresh corpus.

12:06Bella: So picture a teacher handing out practice problems to a class. Problems where every student gets it right teach nothing. Problems where every student gets it wrong teach nothing. The valuable problems are the ones where roughly half the class gets it right — because those are where you can compare what the successful students did against what the others did. Now imagine the teacher can generate more attempts on demand. The smart strategy is: keep generating attempts on this problem until you have a healthy mix of wins and losses, then stop. Easy problems resolve fast, hard problems get more attempts, impossible ones get dropped.

12:48Eric: That's . Generate in small batches, after each batch check whether you have enough successes and enough failures to make a balanced training group, stop as soon as you do, fall back if you can't. The effect is that every batch that hits the optimizer is information-dense, and the compute budget self-paces across prompt difficulty. It's a curriculum that happens during rollout, not before.

13:18Bella: And the two combine. Credit-assignment means you're using your failed for partial supervision; means your RL aren't wasted on prompts that are too easy or too hard. The team runs the full two-stage recipe — SFT first, then RL — and pulls a 30-billion / 3-billion-active model up to 67.5 percent.

13:41Eric: Which is where we should come back to the question, because the number 67.5 means nothing without it.

13:48Bella: Yeah. Take it.

13:49Eric: So this is the rhetorical center of the paper. Most published results are reported as a single number on a single benchmark under a single . The implicit claim is that this number reflects the model's underlying agentic . The team runs a deliberately mean experiment: take competing open-source agents, evaluate them not just under their native harness but under a different one, and under a third one that nobody trained on. The results are, frankly, embarrassing for the field.

14:25Bella: How embarrassing?

14:27Eric: There's a system called that scores 64 percent on its native . Under any other harness, it produces what the paper politely calls "malformed output" — meaning, in practice, the model is emitting that don't even parse as valid . It doesn't score lower; it doesn't score at all. There's , which scores 62.4 on its native harness, holds at 54.9 on a related one — that's reasonable — and then drops to 3.6 percent on a third harness called that nobody trained on. Three point six.

15:05Bella: That's the number from the cold open.

15:07Eric: That's the number from the cold open. And -SWE, by contrast, scores 64.3 on its native , 62.1 on the second, and 45 on the unseen . It drops, but it doesn't fall off a cliff. The driver-on-one-road analogy is the right one here. A driver who has only ever driven one specific car on one specific road can look great until you put them in a different car. The cross-harness test is asking whether the ever knew how to drive, or just how to operate that one car. And by that test, most open-source software-engineering agents have not learned how to drive.

15:47Bella: And the structural reason does better is exactly the architecture we walked through at the start. Because the environment service is -agnostic, the team could collect training under two different , which is a feature-rich multi- platform, and , which is a tiny three-tool shell harness. Two completely different bodies. Same brain learning to work inside both of them. And what the model learned isn't "how to use OpenHands" — it's something closer to a general skill at navigating a code repository, which then transfers to a third harness it never saw.

16:30Eric: And we should be careful here, because this is the place the steelman bites hardest. The cross- comparison isn't perfectly clean. was trained on two ; was trained on one. So part of the generalization gap could just be "Orchard saw more harness diversity in training," which is a recipe choice, not a deep property of the architecture. The paper basically acknowledges this — its recommendation is multi-harness training, full stop. But a stricter experiment would hold the number of training harnesses constant and isolate the architectural contribution. As reported, those two things are confounded.

17:14Bella: That's a fair caveat, and it's a kind of caveat the paper itself partly invites. But even with that confound, the practical takeaway is real: when you look at a leaderboard number for an open-source in early twenty-twenty-six, you should be asking how that agent performs under a different . And in most cases, the honest answer is "we have no idea, and the paper that produced this number has no idea either."

17:43Eric: Which is, in its quiet way, a fairly damning thing to say about the state of the field.

17:49Bella: Alright. Let me shift gears, because the SWE story is the bulk of the empirical case, but the result is where the paper produces its most surprising finding. And I think Eric, this one genuinely deserves a moment of "wait, really?"

18:05Eric: Lay it out.

18:07Bella: - is a browser . Vision-language model — it looks at screenshots of web pages and decides what to click, what to type, what to scroll to. The model has four billion parameters. It was trained on twenty-six hundred tasks. That's it. Twenty-six hundred. And on a suite of three browser benchmarks, it averages 68.4 percent — which beats the 235-billion-parameter teacher model that generated its training data.

18:36Eric: A 4B student beating a 235B teacher.

18:38Bella: By more than seven points. The teacher , evaluated on the same benchmarks, comes in around 61 percent. The student averages 68.

18:49Eric: Okay. We need to slow down here, because the natural reaction is "that can't be right." So how do you square it?

18:56Bella: The framing the paper offers — and I think it's roughly correct — is that and environment-grounded RL teach different things. Imagine a tennis coach who used to be a pro. Technically excellent, can demonstrate any shot. Their student is a kid who plays actual matches against actual opponents every day. After a year, the kid might beat the coach — not because the kid is more talented, but because the kid has been pressure-tested against reality while the coach has been demonstrating in practice. Imitation gives you the form. Outcomes give you the calibration.

19:34Eric: And in the case, the student is doing actual in actual browsers — clicking real buttons on real websites, getting real feedback about whether the click did what was intended. The teacher just demonstrates. There's a layer of grounding the student has access to that the teacher's training data didn't.

19:54Bella: Right. Now, you should poke at that framing, because the analogy isn't airtight.

19:59Eric: It's not. The 235B teacher is itself a strong in its own right; it's not just a demonstrator. And it's not even cleanly stated that the teacher's reported 61 percent is measured under exactly the same conditions as the student. It's plausible that some of the gap is harness fit, not student-exceeds-teacher in any deep sense. The paper doesn't explore that carefully, and it should have.

20:26Bella: But here's the part of the story that I find delightful, completely independent of the headline number. When the team built the training dataset, they generated lots of attempts at lots of browser tasks. Out of the training pool, just under five thousand tasks were ones where the teacher failed on all four attempts they ran. They went and looked at those failures. Forty-one percent of them — over two thousand tasks — were failures because the website served a every single time and the teacher couldn't get past it.

21:00Eric: So thirteen percent of the teacher's task failures across the training set were anti-bot defenses.

21:07Bella: Thirteen percent. Not the 's fault. Not really an agentic failure at all. Just the modern web doing what the modern web does to anything that looks like it might be a robot. Which, fairly, the agent is.

21:20Eric: That's a great texture. And it raises a broader concern about how we measure these things — your training signal isn't telling you what you think it's telling you if a substantial chunk of the failures are captchas.

21:34Bella: The third domain -Claw, which is a personal-assistant tool-use agent — I'll mention briefly because the headline is honestly more compact. They train it on two hundred synthetic tasks. Two hundred. And it outperforms 30-billion-parameter baselines on its eval. The interesting bit is an inference-time twist on the same architectural story: when you pair the same Orchard-Claw model with a different it wasn't trained on — they call it ZeroClaw — the rate jumps by more than fourteen points. The harness layer the model wasn't trained against actually unlocks more of the underlying . So it's a third data point for the same architectural thesis, not a separate story.

22:20Eric: Right. So pulling all of this together, the paper is making one structural argument with three empirical instantiations. The argument is: the environment layer is not plumbing. The way you draw the boundary between training infrastructure and determines what becomes reusable, what generalizes, and what's just memorization of one specific software stack. And the demonstration is that when you draw that boundary in the right place, the same recipe works across software engineering, web browsing, and personal-assistant .

22:55Bella: And that's the real contribution, I think. The 67.5 percent on is a nice number. The 4B-beats-235B is a fun surprise. But the durable idea is the layering — and the cross- experiment that proves the layering matters.

23:12Eric: I want to spend a moment on the honest critiques, because the paper has some that the authors themselves flag and some they don't quite.

23:24Bella: Go ahead.

23:25Eric: The authors are direct about the generalization being a partial result. improves on it; it doesn't solve it. The 45-percent drop on is still a substantial drop. So this is progress in a direction, not arrival at the destination. They're also honest that when they run RL on top of a heavily-saturated checkpoint, the in-distribution score keeps going up but the transfer slightly regresses. That's a real warning. It means the headline 67.5 percent may be more brittle on distribution shift than it looks. Heavy SFT plus RL is a recipe that produces specialization, and specialization can hide as .

24:15Bella: That's the one I want listeners to hold onto. A leaderboard number can be real and brittle at the same time.

24:24Eric: Two things I think the paper undersells. First, the headline RL gain — going from 64 to 67 percent — is achieved by filtering the RL task pool to tasks of intermediate difficulty, where the model is partially competent but not yet reliable. That's a sensible choice for learning signal, but it means the gain is measured on the full benchmark while being optimized on a curated subset. The framing slightly understates that this is targeted improvement, not broad lift. Three points is real. The shape of those three points is narrower than the prose suggests.

25:08Bella: And the second?

25:09Eric: The whole open-source training stack here is sitting on top of teacher models that happen to be M2.5, 3.5 397B. If those models hadn't been released, none of this recipe would work, because there'd be no teacher to from. The paper doesn't really sit with that dependence. The open-source agent ecosystem is bootstrapping itself on a small number of very capable open teachers, and that foundation is more fragile than the leaderboard arms race makes it look. A change in release strategy from any of two or three labs would meaningfully constrain what work like this can do.

25:50Bella: That's a real point, and it's the one I'd want a careful reader to leave with. The infrastructure thesis is durable — environment as a thin service is going to outlast any specific recipe. But the recipe's economics depend on a small set of open frontier teachers staying open. And there's no law of nature that says they will.

26:11Eric: There isn't. And it's the kind of thing where you only notice the dependence when it breaks.

26:17Bella: Okay, let me try to land this. What's the takeaway?

26:21Eric: For me, two things. One: the cross- collapse is the most important diagnostic finding in open-source research this year, and it should change how people read leaderboards. When you see "this model scores X on ," the honest follow-up question is "under which harness, and does it generalize?" And the field hasn't been asking that.

26:44Bella: And two: the idea that failed are not garbage. There's training signal hiding in the parts of a failed attempt that were actually going well, and you can extract it with a small amount of hindsight from the same teacher that produced the failure. That's the kind of idea that's going to travel beyond this paper — it's almost too clean not to.

27:08Eric: And the architectural argument, which is the one I'll carry out of the episode: plumbing is policy. The places you draw boundaries in a software stack determine what becomes composable and what becomes a fortress. The training world has been building fortresses, and is showing what happens when you draw the boundary in the right place instead.

27:29Bella: That's a good place to leave it. Paper's linked in the show notes, along with some further reading if you want to go deeper on training. Thanks for listening to AI Papers: A Deep Dive.