All episodes

Episode 121 · Jun 05, 2026 · 27 min

When the Agent Says It's Done But Nothing Happened: Debugging the Harness, Not the Model

Chen, Wang, Liu et al.

LLM Agent Systems

AI Papers: A Deep Dive — Episode 121: When the Agent Says It's Done But Nothing Happened: Debugging the Harness, Not the Model — cover art

paperdive.ai

Listen

Ep. 121

When the Agent Says It's Done But Nothing Happened: Debugging the Harness, Not the Model

0:00

27 min

Concepts in this episode

AI Agents AI Safety Evaluation & Benchmarks Agent Scaffolding Silent Failure Trajectory Analysis Root Cause Localization Harness Generation Structured Trace Formatting Ablation Studies Reward Hacking Self-Correction Agent Benchmarks

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws

Venue

arXiv:2606.06324

Year

2026

Read the paper

arxiv.org/abs/2606.06324

Also available on

Apple Podcasts Spotify

An AI agent confidently reports a task complete while the database shows nothing actually happened — and no prompt edit on earth can fix it. A new paper argues that for a huge share of agent failures, the model is already good enough, and the real bug lives in the deterministic scaffolding around it. The payoff: that scaffolding is just software, which means you can actually diagnose and repair it.

What you'll take away

Why many agent failures are 'silent successes' — the harness marks a task complete even though nothing changed in the world — and why benchmark scores actively hide them
How HarnessFix borrows a compiler trick (a normalized intermediate representation) to turn messy, framework-specific traces into something you can analyze uniformly
The four-stage pipeline — abstraction, diagnosis, repair, validation — and why repairs draw from a fixed, vetted catalog distilled from real repo fixes rather than letting the agent rewrite itself freely
Why a prompt-only version of the system gets zero improvement on Terminal-Bench while the full system fixes lifecycle, observability, and verification flaws prompts can't reach
The honest limitations: the system largely grades its own diagnoses, raw gains are small (six tasks to nine), results are single-run on one model, and the 'beats human harnesses' comparison isn't a clean head-to-head

Chapters

08:04The bill-splitting disaster
03:22Reframing the agent as model plus harness
06:44Taming the trace
10:07The four-stage repair pipeline
13:29Repair by catalog, not by improvisation
16:52Watching the pipeline fix the bill-splitter
20:14The numbers and the prompt-only ablation
23:37Taking the knife to it

References in this episode

ReAct: Synergizing Reasoning and Acting in Language Models — Defines the reasoning-and-acting loop that constitutes the core of the 'harness'
Reflexion: Language Agents with Verbal Reinforcement Learning — A leading example of the trace-driven self-improvement methods the episode contr
Voyager: An Open-Ended Embodied Agent with Large Language Models — Represents the self-modifying-agent lineage the paper deliberately rejects, lett
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — Provides the repository bug-fixing benchmark style underlying one of the four ev

Full transcript

Also available as a plain-text transcript page.

0:00Juniper: An AI agent gets a chore that sounds trivial. Split the cost of your last three Amazon orders four ways — you and three roommates — and send each roommate a Venmo request labeled "Amazon Purchases." Easy. The agent does exactly what you'd hope. It looks up the roommates, pulls the orders, reads the documentation for the payment system, and fires off three payment requests with the right amount and the right note.

0:27Finn: And then it tells you it's done.

0:29Juniper: It tells you it's done. The environment says "execution successful." The agent calls its finish routine. The task gets marked complete. Except — when you actually check the database afterward, zero payment requests exist. Nothing happened. Every one of those three API calls failed, because the agent left out one required field: the recipient's email. And here's the part that should make your skin crawl — the code the agent wrote caught those errors and quietly threw them away. They never reached the system that was supposed to be watching. So from the outside, everything looks green. Task complete. Money moved. And in reality, nobody got charged a cent.

1:12Finn: That's the failure that should keep agent developers up at night. Not the dramatic crash — the silent success. The metrics say a hundred percent of tasks closed, and the customers are all still waiting.

1:26Juniper: That little disaster is the opening example in a paper called "From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws," and it went up on arXiv on June fourth, twenty-twenty-six — we're recording the very next day, June fifth. Before we get into it, the disclosure: this episode is AI-generated. The script was written by Anthropic's Claude Opus 4.8, and the two voices you're hearing — I'm Juniper, and my co-host is Finn — are both AI voices from Eleven Labs. The show is produced independently; no affiliation with Anthropic or Eleven Labs. And the reason that bill-splitting story is the perfect way in is that it lands the paper's whole thesis before we've defined a single term.

2:12Finn: Right — because your instinct, watching that, is to blame the model. It must have reasoned badly. It forgot the email field, it's not smart enough. The authors say: no. Look again. The model wrote sensible-looking code. The thing that failed was everything around the model.

2:29Juniper: So let me draw the picture the paper depends on, because most people's mental model of an AI agent is too simple. We tend to think of an agent as "a chatbot that can do stuff." The authors want you to see it more precisely. The language model is just the reasoning engine sitting in the middle. Wrapped around it is a thick layer of completely ordinary software — the loop that decides whether to keep going or stop, the descriptions of which tools exist and how to call them, the code that assembles the prompt, the sandbox that runs commands, the logging, the checks that decide whether a task counts as finished.

3:08Finn: They call that wrapper the harness.

3:11Juniper: They call it the harness. And the crucial thing — the thing the whole paper hinges on — is that almost all of that harness is deterministic software. It has functions, arguments, error handling, control flow. The model is a probabilistic black box, sure. But the harness around it is just code. Which means, unlike the model, the harness is something you can actually debug.

3:35Finn: And in the bill-splitting case, you can name exactly which parts of the harness broke. The tool description didn't enforce that the email field was required. The logging swallowed the API errors instead of surfacing them. And the completion check accepted "I'm done" even though nothing in the world had changed. Three different pieces of plumbing, all failing in sequence. The model's in the clear.

4:01Juniper: Think of it like a competent new hire who keeps failing at a task. Your first thought is they're not sharp enough. Then you look closer, and the form they've been given is missing a required field, the system silently discards their submissions without telling them, and the "task complete" button lights up green no matter what. The person is fine. The workplace systems around them are broken. That's the agent's situation exactly — except the agent can't even complain about its broken tools, because it has no idea they're broken.

4:37Finn: So that's the reframe, and it's genuinely against the grain. The last several years of AI have been one story: make the model better. Bigger, smarter, more reasoning. This paper is in the quieter countercurrent that says — for a huge fraction of real agent failures, the model is already good enough, and the bottleneck is the engineering scaffolding. The bug isn't in the model. It's in the scaffolding.

5:03Juniper: And the scaffolding, it turns out, is just software you can debug. That's the sentence the whole system is built to deliver on.

5:12Finn: Which sounds clean as a slogan. The hard part is the doing. Because debugging a normal program is easy in this respect — a bug lives at a specific line of code. Line forty-seven throws an error, you go look at line forty-seven. An agent failure doesn't work like that.

5:29Juniper: No, and this is the genuinely hard problem the paper is solving. When an agent runs, it leaves behind this sprawling record of everything that happened — its reasoning text, every tool call, every response from the environment, every state change. The paper calls that the trace, or the trajectory. And it is a mess. It's language-heavy, it's long, and the actual cause of the failure could be buried anywhere in it. It's not a clean stack trace pointing at one line. It's more like a transcript of a very long, very rambly meeting where somewhere in the middle, somebody made the decision that doomed the project — and you have to find it.

6:10Finn: And it gets worse, because every agent framework logs its mess differently. One calls a field "prompt tokens," another calls the same thing "input tokens." One labels the actor "agent name," another says "recipient agent type." So even if you wanted to build tooling to analyze these traces, there's no common format to build against.

6:31Juniper: Which is where the first really nice idea comes in, and it's borrowed straight from compilers. When a compiler processes your code, it doesn't work on the raw text. It first parses everything into a normalized internal structure — a clean graph of nodes with well-defined relationships — and then it analyzes and transforms that. The paper does the exact same trick to agent traces. It takes every framework's idiosyncratic log and converts it into one common structured form.

7:02Finn: An intermediate representation.

7:04Juniper: An intermediate representation — they give it a name, but the intuition is what matters. Imagine you're handed incident reports from a dozen people, all in different formats — one wrote prose, one filled out a form, one sent bullet points. Before you can compare them or analyze them, you transcribe every one into a single standardized template. Now you can ask the same uniform questions of each: what happened here, did it succeed, what did it change, and what came right before it? That's what this first stage does to the trace. Raw mess in, clean common structure out.

7:43Finn: And once it's in that common structure, each step gets tagged with three things that turn out to be the whole game. What role did this step play — was it gathering information, calling a tool, making an edit, checking for completion? Did it actually succeed or fail? And — this is the sharp one — did it change anything in the world?

8:05Juniper: That last tag is the one that catches the bill-splitting disaster. Because the finalization step in that trace did succeed, in the narrow sense that the code ran. But its "did it change anything" tag is empty. Three payment requests were supposed to exist. Zero do. The structure makes that contradiction visible in a way the raw log never did.

8:29Finn: So now they've got the trace tamed into something analyzable. That's one of four stages, right? Walk through the assembly line.

8:37Juniper: Four cooperating agents, and I think the cleanest way to hold it is as a literal assembly line with named handoffs. Stage one, the one we just described — abstraction. It takes the raw trace and produces the clean structured version. Stage two is diagnosis. It works backward from the observed failure — the evaluator said this task failed — and follows the links between steps back to a ranked set of suspects, then figures out which step is actually responsible and which layer of the harness it belongs to.

9:12Finn: And because one bad run could just be a fluke, it doesn't trust a single trace.

9:17Juniper: It doesn't. It consolidates diagnoses across many failed runs into what they call flaw records — recurring patterns. "Completion accepted without any task-relevant change" isn't a one-off; it shows up again and again, and that repetition is what makes it worth fixing. Then stage three is repair, which generates an actual code patch. And stage four is validation, which decides whether to accept that patch. Parse, diagnose, patch, validate. Four handoffs down the line.

9:50Finn: I want to sit on stage three, the repair stage, because the design choice there is the most opinionated thing in the paper, and it's where Juniper's "it's just software" framing gets interesting tension. The code that's implicated is often core orchestration logic — the heart of the agent. And there's a whole school of thought, the self-modifying-agent lineage, that says: let the agent freely rewrite itself. Maximum flexibility. The paper looks at that and says, no, that's how you get architectural breakage and chaos.

10:27Juniper: So what do they do instead?

10:29Finn: A fixed menu. A catalog of allowed repair operators, organized by which layer they fix — things like "completion guarding," "loop guarding," "argument validation," "API-error logging." And here's the detail I like: those operators weren't invented by the authors out of thin air. They were distilled from the actual version history of real agent repositories — the real fixes that real maintainers committed over time when their harnesses broke in these ways.

11:01Juniper: So it's institutional memory, encoded as a menu.

11:06Finn: Exactly. And the analogy that nails it is a surgeon. A surgeon doesn't improvise a brand-new operation mid-procedure. They work from a vetted catalog of known procedures, each one specifying what may be cut and what must absolutely not be touched. That's safer than freelancing — even though it means a genuinely novel condition with no established procedure leaves you stuck. The paper makes that exact trade deliberately. The diagnosed flaw gets mapped to an operator, and the operator gets turned into what they call a repair specification — basically a contract. It says which files you're allowed to touch, what's strictly off-limits — the benchmark data, the evaluators, the test sets — what behavior the patch has to add, and the bar it has to clear to be accepted.

11:56Juniper: And that acceptance bar is the one piece of real math in the paper, though it's math you can say in a sentence. A patch is accepted only if two things are both true. One: it demonstrably makes the diagnosed failure happen less often. Two: it doesn't break more than a tolerated handful of tasks that were already working. Both thresholds are tunable.

12:18Finn: Which is just classic regression discipline made explicit. Every working software team already does the second half — you change something, you re-run your existing tests to make sure you didn't quietly break what used to work. The first half is the new wrinkle, and it's subtler than it sounds.

12:37Juniper: Say more, because this is the part I think people will skip past.

12:42Finn: So how do you count whether a flaw "still occurs" on a validation task? You don't just check whether the task passes. You re-run the agent, and you only count the flaw as present if the failure that comes back is one the diagnosis machinery attributes to the same layer and the same root cause as the original flaw record. So the success criterion is tied to the causal story, not the pass-fail number. A patch that accidentally makes a task pass for some unrelated reason doesn't get credit for fixing the flaw.

13:13Juniper: Which is elegant — and also, I suspect, exactly where your skeptical antenna starts twitching.

13:20Finn: Oh, we'll get there. Hold that thought, because it's the best critique in the episode. But let's finish the happy path first.

13:27Juniper: Let's go back to the bill-splitting disaster, because now we can watch the whole pipeline actually chew on it. The trace gets abstracted into six steps. Diagnosis works backward from the evaluator's verdict — zero payment requests created — and it doesn't find one culprit. It finds three, spread across three different layers of the harness.

13:48Finn: Three separate plumbing failures stacking up.

13:51Juniper: Stacking up. The step where the agent built the request and left out the required email field — that's a tool-interface flaw. The request body was malformed and nothing caught it. The step where the API errors got swallowed and never surfaced — that's an observability flaw. The harness was blind to its own failures. And the step where the finish routine got accepted despite nothing changing in the world — that's a lifecycle and verification flaw. Three layers, one cascade.

14:21Finn: And they all consolidate into a single flaw record: completion accepted without a task-relevant state effect.

14:27Juniper: Which maps to a primary repair operator — completion guarding — plus two supporting ones for the argument validation and the error logging. And the resulting patch constrains the harness so that the finish routine gets blocked when no relevant database change has actually occurred. The "done" button can't light up green over an empty database anymore.

14:49Finn: And here's the line I'd underline in the whole paper. No prompt edit could fix that. You can write the most beautiful system prompt in the world — "please remember the email field, please check your work, please don't claim success falsely" — and it does not touch the runtime mechanism that decides what "complete" means. That fix lives in code the prompt can't reach.

15:12Juniper: This is the spine of the paper, and I want to make sure the contrast is sharp, because the comparison isn't against doing nothing. There's a whole family of methods that already learn from an agent's own traces — the paper compares against several. And they're good systems. But what do they actually edit? They edit the words you send the model. They evolve the prompt, they build up guidance memory, they refine instructions. Smart approaches. But their edit surface is almost entirely the prompt.

15:43Finn: And that's where the numbers get brutal, in the best way. The authors run an ablation — their own system, but restricted to prompt-only edits. On Terminal-Bench, the command-line benchmark, the prompt-only version gets zero improvement. Not a little. Zero. Same six tasks as the completely untouched harness. The full system gets nine. Fifty percent more, and the entire gap is stuff a prompt was never able to touch.

16:09Juniper: And the paper has this clean diagnostic table behind it. The prompt-evolution methods only ever edit one layer — the context and memory layer. One of them reaches a little into tool descriptions. HarnessFix is the only method that systematically gets into the lifecycle, the observability, the verification, and the governance layers. That single fact is the whole explanation for why it wins. The other methods can sometimes even see the symptom — they're looking at the same trace — and they still can't repair it, because the repair lives in machinery they're not allowed to touch.

16:49Finn: Debugging by superstition versus debugging by inspection. That's the difference. The old way: tweak the prompt, add a retry, bolt on a guardrail, re-run the benchmark, see if the number went up. This is: which exact part of my scaffolding is responsible, and here's a narrow patch, validated against regressions.

17:10Juniper: So let's put real numbers on the win, and then Finn, I think you should take the knife to it. Across the four benchmarks — and they're genuinely different domains, repository bug-fixing, command-line workflows, open-ended research questions, and that stateful app automation world the bill-splitting lives in — the held-out test improvement over the starting harness ranges from about fifteen percent to fifty percent. And critically, the test tasks are never seen during repair or validation. Same protocol everywhere, two-to-one-to-two split, test set untouched.

17:48Finn: And it beats the humans.

17:50Juniper: It beats the humans. On every single benchmark, HarnessFix bolted onto a basic starting harness comes out ahead of the strongest hand-built, human-designed harness for that domain. And the accepted edits are wonderfully concrete. On the command-line benchmark, it learns to block commands like exit, logout, and shutdown — things that kill the session out from under the agent. On the repository benchmark, it rejects a sweeping version-control command that would stage every changed file and risk submitting junk. On the research benchmark, it added spreadsheet and audio-file support, and made file-description failures non-fatal instead of letting them tank the run.

18:33Finn: Okay. Those last ones are great, actually, because they're so mundane. "Made file-description failures non-fatal." That's not AI magic. That's a competent engineer reading a log and going, "oh, this shouldn't be a hard crash." Which is sort of the point — and also my way in to the critique. Juniper, can I push on the whole thing?

18:55Juniper: Please. Take it apart.

18:57Finn: The hardest problem is circularity, and it's the thing I flagged earlier. Every single stage of this system is itself an LLM agent. The abstraction is an LLM. The diagnosis is an LLM. The repair is an LLM. The validation is an LLM. And remember how a flaw is counted as "still occurring" — it's whether the diagnosis machinery attributes the new failure to the same root cause. So the system that decides a flaw has been reduced is the same system that diagnosed the flaw in the first place.

19:28Juniper: It's grading its own homework.

19:30Finn: It's grading its own homework. The student writes the exam answer and then grades the exam. Now — to be fair, and the paper does earn this — there's a real external check. The headline numbers come from the held-out test set, scored by each benchmark's own independent evaluator. That's the outside examiner grading the final, even if all the practice rounds were self-graded. So I trust the top-line improvement numbers. What I don't have is any independent verification that the internal diagnosis is correct. When it says "this was an observability flaw in step five," there's no ground-truth fault label confirming that. It's the model's judgment, checked by the same model's judgment.

20:14Juniper: And the diagnosis being correct is load-bearing for the whole "evidence-grounded, not superstition" claim. If the localization is wrong but the patch happens to help anyway, you're back to luck — just dressed up in better vocabulary.

20:28Finn: Right. Second problem — and this one the surgeon analogy already set up. The repair catalog is hand-curated. Distilled from real repos, which is a genuine strength for safety. But it means the system can only fix flaws for which an operator already exists. A truly novel failure mode, outside the menu, has no edit surface at all. Which is, in a narrower form, the exact criticism the paper levels at the prompt-only methods. They can't reach the runtime; HarnessFix can't reach a flaw with no procedure in its catalog. The surgeon with no established procedure for your condition just... can't operate.

21:06Juniper: That one feels honest to me as a designed-in limit rather than an oversight, but it's real.

21:12Finn: It is. Third — and this is the one I'd want a reviewer to press on. The absolute gains on the hard benchmarks are small in raw terms. "Fifty percent improvement on Terminal-Bench" sounds enormous. It's six tasks to nine, out of thirty-four. When you're moving three tasks, a little noise in either direction matters a lot — and the paper reports single runs. No variance, no confidence intervals. So we genuinely can't tell how stable these deltas are. The story is consistent across four benchmarks, which helps. But "consistent direction" is not the same as "statistically nailed down."

21:50Juniper: And it's all one base model, one configuration.

21:53Finn: One model throughout — a smaller GPT-5 variant. They fix the model deliberately, which is the right call for isolating the harness effect. But it leaves the obvious question dangling: does harness repair help this much with a much stronger model? Or does a stronger model just... not make these mistakes, so there's less to repair? They mention auxiliary experiments varying model strength but don't show them in what we've got. So that's asserted, not demonstrated.

22:22Juniper: There's also something subtle in how the "beats the humans" comparison is framed.

22:27Finn: There is, and it's the one I'd flag as mildly apples-to-oranges. HarnessFix is bolted on top of a basic starting harness and given a training budget to adapt. The human-designed baselines are other people's complete, separate harnesses — not built on the same starting point, and not given HarnessFix's adaptation budget. So "beats the strongest human harness" is impressive, but it's not a clean controlled head-to-head. It's more "our adaptive method on top of a base, versus their static finished products."

23:00Juniper: All of which I think is fair, and worth saying the authors are fairly contained about their own concessions. The one limitation they put front and center is generalization — they're explicit that the recurring flaw patterns they found are tied to these four benchmarks, and that whether those patterns hold across other agent domains is future work. The variance question and the single-model question, they don't dwell on. So those are ours to voice, not theirs.

23:30Finn: And I want to be clear that the critique doesn't sink the paper. It scopes it. This is an offline repair loop — you collect traces, you run this pipeline over them, you get a patched harness. It's not a live, self-healing agent fixing itself in real time. The gains are real and consistent and moderate. That's a perfectly respectable thing to be. It's just not the thing a breathless headline would make it.

23:56Juniper: And I'd argue the reframe is worth more than the numbers anyway. There's one more ablation finding I love, because it's the system explaining its own design. They tried removing the regression check — the "don't break what already works" half of the acceptance rule. On the repository benchmark, it barely hurt. Fifty-three tasks instead of fifty-seven. Which means the diagnosis and repair stages were already producing good candidates on their own.

24:24Finn: But on the app-automation benchmark, the one with the bill-splitting, the gap got wider without the guard.

24:31Juniper: Wider — thirty-five instead of thirty-eight. And the explanation is lovely: that domain has the most interconnected layers, where a fix in one place easily creates a regression in an adjacent one. So the regression guard matters most exactly where the system is most tangled. As the authors put it, regression-aware acceptance is what turns useful candidates into safe-to-accept changes. The safety net earns its keep precisely where the fall would be worst.

25:00Finn: Which, stepping back, is the real contribution. Not the specific numbers. The discipline. Today, when an agent product misbehaves, teams mostly grope. They tweak prompts, add retries, bolt on guardrails, re-run the benchmark, pray the number went up. This paper says: you can ask which part of your scaffolding is responsible, get a localized, evidence-grounded answer, and then a narrow patch checked against regressions. That's a different posture toward failure entirely.

25:29Juniper: And it drags into the light a failure mode that benchmark scores actively hide. An agent that confidently marks a task complete while having changed nothing in the world — that's a reliability nightmare, and a single pass-fail number sometimes won't even catch it. The autoresponder that closes every ticket with "resolved." Your dashboard is gorgeous. Your customers are furious. No amount of prompt-tuning reaches that, because the lie isn't in what the model says — it's in how the harness decides what "done" means.

26:01Finn: The model is fine. The plumbing is broken. And the plumbing is software.

26:06Juniper: That's the whole paper in seven words, and I think it's the right note to leave people on. The field has poured extraordinary effort into making the models smarter. This is a quiet, useful argument that for a lot of real failures, the smartest thing you can do is go read your own logs and fix your own scaffolding — and that the scaffolding will hold still long enough to be fixed.

26:29Finn: The paper's called "From Failed Trajectories to Reliable LLM Agents." The show notes have a link to it, along with some related reading if you want to follow this thread into the self-improving-agent debate it's quietly picking a fight with.

26:44Juniper: And if you want the full transcript with every term defined inline — harness, intermediate representation, all of it tappable — plus the links over to other episodes that touch these same ideas, that's all on paperdive.ai.

26:58Finn: This has been AI Papers: A Deep Dive. Thanks for listening.

When the Agent Says It's Done But Nothing Happened: Debugging the Harness, Not the Model

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes