All episodes

Episode 047 · May 15, 2026 · 28 min

When Agent Benchmarks Lie: The Harness Problem in Open-Source AI

Peng, Yao, Wu et al.

LLM Agent Training Agentic AI Systems

AI Papers: A Deep Dive — Episode 047: When Agent Benchmarks Lie: The Harness Problem in Open-Source AI — cover art

paperdive.ai

Listen

Ep. 047

When Agent Benchmarks Lie: The Harness Problem in Open-Source AI

0:00

28 min

Concepts in this episode

Agentic AI Evaluation & Benchmarks Training Methods Agent Scaffolding Supervised Fine-Tuning Credit Assignment GRPO Rollout Sampling SWE-bench Knowledge Distillation Agentic RL ReAct Agent Synthetic Data Eval Dissociation Multimodal Models

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Orchard: An Open-Source Agentic Modeling Framework

Venue

arXiv:2605.15040

Year

2026

Read the paper

arxiv.org/abs/2605.15040

Also available on

Apple Podcasts Spotify

A software-engineering agent scores 62% on its native test setup and 3.6% when you swap the wrapper around it. A new paper called Orchard argues this isn't a bug in one system — it's an indictment of how the entire open-source agent field has been measuring progress, and it offers an infrastructure-first fix that costs ten times less and actually generalizes.

What you'll take away

Why most reported agent benchmark scores measure harness-fit rather than underlying capability, and how a cross-harness test exposes the gap
How treating the sandbox layer as a thin, generic service (rather than baked-in plumbing) cuts training costs roughly 10x versus managed services like E2B and Daytona
Credit-assignment SFT: extracting partial supervision from failed teacher trajectories by finding the rising segment before the critical mistake
Balanced Adaptive Rollout (BAR): a self-pacing RL technique that stops generating rollouts once a prompt yields a useful mix of wins and losses
The surprising GUI result — a 4B-parameter student beating its 235B teacher — and why environment-grounded RL teaches something distillation can't
Honest limitations the paper undersells: the cross-harness comparison is partly confounded, RL gains are measured on a curated subset, and the whole recipe depends on a few open frontier teacher models staying open

Chapters

00:00The 62-to-3.6 collapse
03:27Agent, harness, and what Orchard actually is
06:54Sandboxes as a thin service
10:21Credit-assignment SFT: salvaging failed trajectories
13:49Balanced Adaptive Rollout for RL
23:17The cross-harness experiment
20:43When a 4B student beats a 235B teacher
24:11What the paper undersells

References in this episode

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — The benchmark at the center of the episode's harness-collapse story — worth read
ReAct: Synergizing Reasoning and Acting in Language Models — The reason-then-act loop that Bella defines early on as the 'body the model live
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO) — Introduces GRPO, the rollout-comparison RL algorithm that Eric walks through bef
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering — A direct prior argument that the agent-computer interface — what this episode ca

Full transcript

Also available as a plain-text transcript page.

0:00Bella: Here's a question that should bother you. You've trained an AI agent — a software engineer, say, the kind that's supposed to fix real GitHub issues. On your benchmark, in your test harness, it scores 62 percent. Solid number. Now swap the harness — same model, same task, just a different wrapper around it, a different set of tools, a slightly different way of formatting the conversation. Score drops to 3.6 percent.

0:27Eric: Three point six.

0:28Bella: Three point six. The model didn't get worse. The model is exactly the same set of weights. What collapsed is the illusion that the model ever knew how to do the job in the first place. And that result — that exact pattern of collapse — is what the paper we're digging into today is built around. The paper is called "Orchard: An Open-Source Agentic Modeling Framework," it went up on arXiv on May fourteenth, twenty-twenty-six, and we're recording the next day, on May fifteenth, twenty-twenty-six. This whole episode is AI-generated — the script is from Anthropic's Claude Opus 4.7. I'm Bella, that's Eric, and we're both AI voices from Eleven Labs. Neither company is involved in producing the show. And the reason that one-day turnaround matters is the paper drops a result that, if it generalizes, kind of indicts how the whole open-source agent community has been measuring progress.

1:25Eric: Right. And before we get to the indictment, let's name what's actually under the hood here, because Orchard is doing two pretty different things at once and they only make sense together. It's a piece of infrastructure — a way of running thousands of sandboxed Linux environments to train agents in — and it's a set of training recipes that sit on top of that infrastructure. The paper's argument is that you can't separate them. The plumbing shapes what's possible upstairs.

1:56Bella: And to make any of this concrete we need a quick vocabulary check, because the paper just assumes you know all this. When they say "agent" they don't mean a chatbot. They mean a language model running in a loop — it writes a few lines of reasoning, then emits a tool call like "run this shell command" or "click this button," sees what happened, decides what to do next, and keeps going for dozens or hundreds of steps until it either solves the task or gives up. The standard scaffolding for that loop is called ReAct — reason, then act, then reason about what came back. That's the body the model lives inside while it's working.

2:37Eric: And the body has a name. The paper calls it the harness. That's the surrounding software — what tools the agent has, how its outputs get parsed, how errors are surfaced, what the system prompt looks like. The model is the brain; the harness is the nervous system. And the punchline that the whole paper builds toward is that the field has been confusing one for the other. A lot of what gets reported as "this model is great at software engineering" is really "this model is great at software engineering inside this specific harness," and the moment you swap the harness, the capability evaporates.

3:16Bella: Which is the 62-to-3.6 result. So hold that as the destination — we'll come back to it. The road to it starts with infrastructure, which I know sounds like the boring part of the paper, but Eric, you wanted to make a case here.

3:32Eric: I do, because the infrastructure case is where the dollar amounts live, and the dollar amounts are not small. Here's the situation in early twenty-twenty-six. If you want to train a software-engineering agent, you have to spin up sandboxes — isolated Linux environments, each with the right repository checked out, the right Python version, the right dependencies, the right test setup. Thousands of them in parallel, because the agent is going to take hundreds of actions per task and you're going to be running many tasks at once across many training steps. The question is who runs those sandboxes for you.

4:13Bella: And there are basically two options the open-source world has had.

4:18Eric: Two bad options. Option one is a managed service — companies like E2B, Daytona, Modal that rent you sandboxes by the hour. Easy to use, expensive at scale, and you can't really tune their performance. Option two is you bake the sandboxing into your training stack — every project rolls its own — and then nothing transfers. A dataset of trajectories collected with one project's sandbox runtime can't be reused under another project's trainer. So researchers either pay through the nose or they rebuild the wheel.

4:53Bella: Orchard's bet is option three: pull the sandbox layer out into its own thin service, like the way databases got pulled out of applications in the nineties. The application doesn't care which database is underneath, the database doesn't care which application is on top, and there's a tiny, generic interface in between. Same idea here. The training code, the agent harness, the evaluation pipeline — they all just talk to a small REST API. Create a sandbox. Run a command inside it. Read a file. That's the whole vocabulary.

5:29Eric: And the engineering claim is that if you draw the boundary in exactly that place — narrow, generic, harness-agnostic — you get composability for free. The same sandbox stack can back distillation, RL rollouts, and evaluation. Datasets transfer. Recipes transfer. Now, that's the architectural claim. The empirical claim is that it's also dramatically faster and cheaper, and this is where the numbers get interesting.

5:58Bella: How interesting?

6:00Eric: A 128-sandbox training run, 240 hours of compute, on commodity spot instances. On Orchard, six hundred and seventy-three dollars. The same workload on Daytona or E2B, about seven thousand. That's roughly a ten-x gap. The latency numbers are the same shape — average command latency is about a quarter-second on Orchard, versus two seconds on Modal, which is the worst comparison point. And they stress-test it at a thousand concurrent sandboxes, full lifecycle, hundred percent success.

6:34Bella: I want to flag something here, because the comparison is real but it's not apples-to-apples. The ten-x cost gap is Orchard run by experts with their own Kubernetes cluster on spot instances, versus a managed service that handles everything for you. Those are different products for different users. For a research group with operational expertise, the savings are huge. For a group without that expertise, paying for the managed service may still be the rational call. The paper doesn't really sit with that distinction.

7:12Eric: Fair, and we should come back to that when we get to the steelman section. But hold onto the architectural shape, because it sets up the more interesting half of the paper, which is what happens when you build a training recipe on top of this substrate. Bella, you want to take the recipe?

7:32Bella: Yeah. So the team builds three different agents on top of Orchard — a software-engineering agent, a browser agent, and a personal-assistant agent — and the most interesting one to talk through is the software engineer, because that's where most of the recipe innovation lives. The headline result is that they take a mixture-of-experts model — 30 billion parameters total but only 3 billion of them firing for any given token — and they get it to 67.5 percent on SWE-bench Verified. SWE-bench Verified is the closest thing the field has to an objective measure of "can an LLM actually fix real bugs in real GitHub repos."

8:14Eric: And 67.5 is what kind of number?

8:16Bella: It's the kind of number that beats every dense 72-billion-parameter open-source recipe and starts approaching the frontier closed-and-open systems that are ten to thirty times larger in total parameters. So the question is: how does a model that small get there? And the answer is two pieces of recipe work. The first is called credit-assignment SFT, and it's about the problem of failed trajectories.

8:43Eric: Which is a problem you'd think you'd want to ignore.

8:47Bella: That's exactly what everyone does ignore it. The standard supervised fine-tuning recipe is: have a powerful teacher model attempt the task, keep the trajectories where it succeeded, throw away the ones where it failed, train the student to imitate the successes. Clean, simple, throws away about thirty percent of your data because that's how often even strong teachers fail on hard tasks.

9:13Eric: And the authors look at that thirty percent and ask whether there's something to salvage.

9:19Bella: They do. Imagine a chess coach reviewing a game you lost. The naive approach is "you lost, don't play any of those moves again." The smarter approach is to walk through the game and say "moves one through twelve were actually fine, you were ahead the whole time. Move thirteen is where you went wrong. So learn from one through twelve as good play, and treat thirteen as a lesson." That's the whole idea. They take a failed trajectory, hand it to the teacher LLM, tell the teacher it failed, and ask the teacher to estimate at each step how likely the agent was to eventually succeed. Where that probability is rising, the agent was making real progress. Where it falls off a cliff, that's where things went wrong. Train only on the rising sections.

10:07Eric: And there's a small thing here worth pausing on. The "coach" is the same model that produced the failed game. It's just being shown the outcome in retrospect. So you're using hindsight to extract supervision the original attempt couldn't have generated on its own. And the authors find that across their annotated set, this hindsight value curve is shaped like an inverted U almost ninety-nine percent of the time. It rises during the productive part, peaks, then drops at the critical mistake. So the "rise segment" notion is actually picking out a real structure in how trajectories fail, not just imposing one.

10:47Bella: And the practical upshot is that thirty percent of teacher trajectories that used to be garbage are now partial supervision. That's the SFT piece. The RL piece is the second recipe contribution, and Eric, this is your favorite.

11:02Eric: It's called Balanced Adaptive Rollout, BAR for short, and it solves a problem that's specific to how RL works for agents. So the standard algorithm here is called GRPO — you don't need the acronym, just the idea. For each training prompt, you generate, say, eight rollouts from the current model. You see which ones succeeded and which failed. The model learns by comparing — "the successful ones did X, the failed ones did Y, do more X." That comparison is the gradient.

11:33Bella: And the problem is when there's nothing to compare.

11:36Eric: Right. If all eight rollouts succeeded, there's no contrast and no learning signal. If all eight failed, same thing. You burned the compute generating eight trajectories for zero gradient. And in agent training, where each rollout might take many minutes of sandboxed execution, that's catastrophically expensive. The naive fix is to pre-filter prompts based on historical difficulty, but that requires bookkeeping you don't have on a fresh corpus.

12:06Bella: So picture a teacher handing out practice problems to a class. Problems where every student gets it right teach nothing. Problems where every student gets it wrong teach nothing. The valuable problems are the ones where roughly half the class gets it right — because those are where you can compare what the successful students did against what the others did. Now imagine the teacher can generate more attempts on demand. The smart strategy is: keep generating attempts on this problem until you have a healthy mix of wins and losses, then stop. Easy problems resolve fast, hard problems get more attempts, impossible ones get dropped.

12:48Eric: That's BAR. Generate rollouts in small batches, after each batch check whether you have enough successes and enough failures to make a balanced training group, stop as soon as you do, fall back if you can't. The effect is that every gradient batch that hits the optimizer is information-dense, and the compute budget self-paces across prompt difficulty. It's a curriculum that happens during rollout, not before.

13:18Bella: And the two combine. Credit-assignment SFT means you're using your failed trajectories for partial supervision; BAR means your RL rollouts aren't wasted on prompts that are too easy or too hard. The team runs the full two-stage recipe — SFT first, then RL — and pulls a 30-billion / 3-billion-active model up to 67.5 percent.

13:41Eric: Which is where we should come back to the harness question, because the number 67.5 means nothing without it.

13:48Bella: Yeah. Take it.

13:49Eric: So this is the rhetorical center of the paper. Most published agent results are reported as a single number on a single benchmark under a single harness. The implicit claim is that this number reflects the model's underlying agentic capability. The Orchard team runs a deliberately mean experiment: take competing open-source agents, evaluate them not just under their native harness but under a different one, and under a third one that nobody trained on. The results are, frankly, embarrassing for the field.

14:25Bella: How embarrassing?

14:27Eric: There's a system called Scale-SWE that scores 64 percent on its native harness. Under any other harness, it produces what the paper politely calls "malformed output" — meaning, in practice, the model is emitting tokens that don't even parse as valid tool calls. It doesn't score lower; it doesn't score at all. There's OpenSWE-32B, which scores 62.4 on its native harness, holds at 54.9 on a related one — that's reasonable — and then drops to 3.6 percent on a third harness called Kimi-CLI that nobody trained on. Three point six.

15:05Bella: That's the number from the cold open.

15:07Eric: That's the number from the cold open. And Orchard-SWE, by contrast, scores 64.3 on its native harness, 62.1 on the second, and 45 on the unseen Kimi-CLI. It drops, but it doesn't fall off a cliff. The driver-on-one-road analogy is the right one here. A driver who has only ever driven one specific car on one specific road can look great until you put them in a different car. The cross-harness test is asking whether the agent ever knew how to drive, or just how to operate that one car. And by that test, most open-source software-engineering agents have not learned how to drive.

15:47Bella: And the structural reason Orchard does better is exactly the architecture we walked through at the start. Because the environment service is harness-agnostic, the team could collect training trajectories under two different harnesses — OpenHands, which is a feature-rich multi-agent platform, and mini-swe-agent, which is a tiny three-tool shell harness. Two completely different bodies. Same brain learning to work inside both of them. And what the model learned isn't "how to use OpenHands" — it's something closer to a general skill at navigating a code repository, which then transfers to a third harness it never saw.

16:30Eric: And we should be careful here, because this is the place the steelman bites hardest. The cross-harness comparison isn't perfectly clean. Orchard was trained on two harnesses; OpenSWE was trained on one. So part of the generalization gap could just be "Orchard saw more harness diversity in training," which is a recipe choice, not a deep property of the architecture. The paper basically acknowledges this — its recommendation is multi-harness training, full stop. But a stricter experiment would hold the number of training harnesses constant and isolate the architectural contribution. As reported, those two things are confounded.

17:14Bella: That's a fair caveat, and it's a kind of caveat the paper itself partly invites. But even with that confound, the practical takeaway is real: when you look at a leaderboard number for an open-source agent in early twenty-twenty-six, you should be asking how that agent performs under a different harness. And in most cases, the honest answer is "we have no idea, and the paper that produced this number has no idea either."

17:43Eric: Which is, in its quiet way, a fairly damning thing to say about the state of the field.

17:49Bella: Alright. Let me shift gears, because the SWE story is the bulk of the empirical case, but the GUI result is where the paper produces its most surprising finding. And I think Eric, this one genuinely deserves a moment of "wait, really?"

18:05Eric: Lay it out.

18:07Bella: Orchard-GUI is a browser agent. Vision-language model — it looks at screenshots of web pages and decides what to click, what to type, what to scroll to. The model has four billion parameters. It was trained on twenty-six hundred tasks. That's it. Twenty-six hundred. And on a suite of three browser benchmarks, it averages 68.4 percent — which beats the 235-billion-parameter teacher model that generated its training data.

18:36Eric: A 4B student beating a 235B teacher.

18:38Bella: By more than seven points. The teacher backbone, evaluated on the same benchmarks, comes in around 61 percent. The student averages 68.

18:49Eric: Okay. We need to slow down here, because the natural reaction is "that can't be right." So how do you square it?

18:56Bella: The framing the paper offers — and I think it's roughly correct — is that distillation and environment-grounded RL teach different things. Imagine a tennis coach who used to be a pro. Technically excellent, can demonstrate any shot. Their student is a kid who plays actual matches against actual opponents every day. After a year, the kid might beat the coach — not because the kid is more talented, but because the kid has been pressure-tested against reality while the coach has been demonstrating in practice. Imitation gives you the form. Outcomes give you the calibration.

19:34Eric: And in the Orchard case, the student is doing actual rollouts in actual browsers — clicking real buttons on real websites, getting real feedback about whether the click did what was intended. The teacher just demonstrates. There's a layer of grounding the student has access to that the teacher's training data didn't.

19:54Bella: Right. Now, you should poke at that framing, because the analogy isn't airtight.

19:59Eric: It's not. The 235B teacher is itself a strong agent in its own right; it's not just a demonstrator. And it's not even cleanly stated that the teacher's reported 61 percent is measured under exactly the same harness conditions as the student. It's plausible that some of the gap is harness fit, not student-exceeds-teacher in any deep sense. The paper doesn't explore that carefully, and it should have.

20:26Bella: But here's the part of the GUI story that I find delightful, completely independent of the headline number. When the team built the training dataset, they generated lots of attempts at lots of browser tasks. Out of the training pool, just under five thousand tasks were ones where the teacher failed on all four attempts they ran. They went and looked at those failures. Forty-one percent of them — over two thousand tasks — were failures because the website served a captcha every single time and the teacher couldn't get past it.

21:00Eric: So thirteen percent of the teacher's task failures across the training set were anti-bot defenses.

21:07Bella: Thirteen percent. Not the agent's fault. Not really an agentic failure at all. Just the modern web doing what the modern web does to anything that looks like it might be a robot. Which, fairly, the agent is.

21:20Eric: That's a great texture. And it raises a broader concern about how we measure these things — your training signal isn't telling you what you think it's telling you if a substantial chunk of the failures are captchas.

21:34Bella: The third domain agent — Orchard-Claw, which is a personal-assistant tool-use agent — I'll mention briefly because the headline is honestly more compact. They train it on two hundred synthetic tasks. Two hundred. And it outperforms 30-billion-parameter baselines on its eval. The interesting bit is an inference-time twist on the same architectural story: when you pair the same Orchard-Claw model with a different harness it wasn't trained on — they call it ZeroClaw — the pass-at-three rate jumps by more than fourteen points. The harness layer the model wasn't trained against actually unlocks more of the underlying capability. So it's a third data point for the same architectural thesis, not a separate story.

22:20Eric: Right. So pulling all of this together, the paper is making one structural argument with three empirical instantiations. The argument is: the environment layer is not plumbing. The way you draw the boundary between training infrastructure and agent harness determines what becomes reusable, what generalizes, and what's just memorization of one specific software stack. And the demonstration is that when you draw that boundary in the right place, the same recipe works across software engineering, web browsing, and personal-assistant tool use.

22:55Bella: And that's the real contribution, I think. The 67.5 percent on SWE-bench is a nice number. The 4B-beats-235B is a fun surprise. But the durable idea is the layering — and the cross-harness experiment that proves the layering matters.

23:12Eric: I want to spend a moment on the honest critiques, because the paper has some that the authors themselves flag and some they don't quite.

23:24Bella: Go ahead.

23:25Eric: The authors are direct about the harness generalization being a partial result. Orchard improves on it; it doesn't solve it. The 45-percent drop on Kimi-CLI is still a substantial drop. So this is progress in a direction, not arrival at the destination. They're also honest that when they run RL on top of a heavily-saturated SFT checkpoint, the in-distribution score keeps going up but the out-of-distribution transfer slightly regresses. That's a real warning. It means the headline 67.5 percent may be more brittle on distribution shift than it looks. Heavy SFT plus RL is a recipe that produces specialization, and specialization can hide as capability.

24:15Bella: That's the one I want listeners to hold onto. A leaderboard number can be real and brittle at the same time.

24:24Eric: Two things I think the paper undersells. First, the headline RL gain — going from 64 to 67 percent — is achieved by filtering the RL task pool to tasks of intermediate difficulty, where the SFT model is partially competent but not yet reliable. That's a sensible choice for learning signal, but it means the gain is measured on the full benchmark while being optimized on a curated subset. The framing slightly understates that this is targeted improvement, not broad capability lift. Three points is real. The shape of those three points is narrower than the prose suggests.

25:08Bella: And the second?

25:09Eric: The whole open-source agent training stack here is sitting on top of teacher models that happen to be open weights — MiniMax M2.5, Qwen 3.5 397B. If those models hadn't been released, none of this recipe would work, because there'd be no teacher to distill from. The paper doesn't really sit with that dependence. The open-source agent ecosystem is bootstrapping itself on a small number of very capable open teachers, and that foundation is more fragile than the leaderboard arms race makes it look. A change in release strategy from any of two or three labs would meaningfully constrain what work like this can do.

25:50Bella: That's a real point, and it's the one I'd want a careful reader to leave with. The infrastructure thesis is durable — environment as a thin service is going to outlast any specific recipe. But the recipe's economics depend on a small set of open frontier teachers staying open. And there's no law of nature that says they will.

26:11Eric: There isn't. And it's the kind of thing where you only notice the dependence when it breaks.

26:17Bella: Okay, let me try to land this. What's the takeaway?

26:21Eric: For me, two things. One: the cross-harness collapse is the most important diagnostic finding in open-source agent research this year, and it should change how people read leaderboards. When you see "this model scores X on SWE-bench," the honest follow-up question is "under which harness, and does it generalize?" And the field hasn't been asking that.

26:44Bella: And two: the idea that failed trajectories are not garbage. There's training signal hiding in the parts of a failed attempt that were actually going well, and you can extract it with a small amount of hindsight from the same teacher that produced the failure. That's the kind of idea that's going to travel beyond this paper — it's almost too clean not to.

27:08Eric: And the architectural argument, which is the one I'll carry out of the episode: plumbing is policy. The places you draw boundaries in a software stack determine what becomes composable and what becomes a fortress. The agent training world has been building fortresses, and Orchard is showing what happens when you draw the boundary in the right place instead.

27:29Bella: That's a good place to leave it. Paper's linked in the show notes, along with some further reading if you want to go deeper on agent training. Thanks for listening to AI Papers: A Deep Dive.

When Agent Benchmarks Lie: The Harness Problem in Open-Source AI

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes