All episodes
Episode 121 · Jun 05, 2026 · 27 min

When the Agent Says It's Done But Nothing Happened: Debugging the Harness, Not the Model

Chen, Wang, Liu et al.

LLM Agent Systems
AI Papers: A Deep Dive — Episode 121: When the Agent Says It's Done But Nothing Happened: Debugging the Harness, Not the Model — cover art
paperdive.ai
Ep. 121
When the Agent Says It's Done But Nothing Happened: Debugging the Harness, Not the Model
0:00
27 min
Paper
From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws
Venue
arXiv:2606.06324
Year
2026
Read the paper
arxiv.org/abs/2606.06324
Also available on
Apple Podcasts Spotify

An AI confidently reports a task complete while the database shows nothing actually happened — and no prompt edit on earth can fix it. A new paper argues that for a huge share of agent failures, the model is already good enough, and the real bug lives in the deterministic around it. The payoff: that scaffolding is just software, which means you can actually diagnose and repair it.

What you'll take away

  • Why many failures are 'silent successes' — the marks a task complete even though nothing changed in the world — and why benchmark scores actively hide them
  • How borrows a trick (a normalized ) to turn messy, framework-specific traces into something you can analyze uniformly
  • The four-stage — abstraction, diagnosis, repair, validation — and why repairs draw from a fixed, vetted catalog from real repo fixes rather than letting the rewrite itself freely
  • Why a prompt-only version of the system gets zero improvement on while the full system fixes lifecycle, observability, and verification flaws prompts can't reach
  • The honest limitations: the system largely grades its own diagnoses, raw gains are small (six tasks to nine), results are single-run on one model, and the 'beats human ' comparison isn't a clean head-to-head

Chapters

  1. 08:04The bill-splitting disaster
  2. 03:22Reframing the agent as model plus harness
  3. 06:44Taming the trace
  4. 10:07The four-stage repair pipeline
  5. 13:29Repair by catalog, not by improvisation
  6. 16:52Watching the pipeline fix the bill-splitter
  7. 20:14The numbers and the prompt-only ablation
  8. 23:37Taking the knife to it

References in this episode

Also available as a plain-text transcript page.

0:00Juniper: An AI gets a chore that sounds trivial. Split the cost of your last three Amazon orders four ways — you and three roommates — and send each roommate a Venmo request labeled "Amazon Purchases." Easy. The agent does exactly what you'd hope. It looks up the roommates, pulls the orders, reads the documentation for the payment system, and fires off three payment requests with the right amount and the right note.

0:27Finn: And then it tells you it's done.

0:29Juniper: It tells you it's done. The environment says "execution successful." The calls its finish routine. The task gets marked complete. Except — when you actually check the database afterward, zero payment requests exist. Nothing happened. Every one of those three calls failed, because the agent left out one required field: the recipient's email. And here's the part that should make your skin crawl — the code the agent wrote caught those errors and quietly threw them away. They never reached the system that was supposed to be watching. So from the outside, everything looks green. Task complete. Money moved. And in reality, nobody got charged a cent.

1:12Finn: That's the failure that should keep developers up at night. Not the dramatic crash — the silent success. The metrics say a hundred percent of tasks closed, and the customers are all still waiting.

1:26Juniper: That little disaster is the opening example in a paper called "From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws," and it went up on on June fourth, twenty-twenty-six — we're recording the very next day, June fifth. Before we get into it, the disclosure: this episode is AI-generated. The script was written by Anthropic's , and the two voices you're hearing — I'm Juniper, and my co-host is Finn — are both AI voices from Eleven Labs. The show is produced independently; no affiliation with Anthropic or Eleven Labs. And the reason that bill-splitting story is the perfect way in is that it lands the paper's whole thesis before we've defined a single term.

2:12Finn: Right — because your instinct, watching that, is to blame the model. It must have reasoned badly. It forgot the email field, it's not smart enough. The authors say: no. Look again. The model wrote sensible-looking code. The thing that failed was everything around the model.

2:29Juniper: So let me draw the picture the paper depends on, because most people's mental model of an AI is too simple. We tend to think of an agent as "a chatbot that can do stuff." The authors want you to see it more precisely. The language model is just the reasoning engine sitting in the middle. Wrapped around it is a thick layer of completely ordinary software — the loop that decides whether to keep going or stop, the descriptions of which tools exist and how to call them, the code that assembles the prompt, the that runs commands, the logging, the checks that decide whether a task counts as finished.

3:08Finn: They call that wrapper the .

3:11Juniper: They call it the . And the crucial thing — the thing the whole paper hinges on — is that almost all of that harness is deterministic software. It has functions, arguments, error handling, control flow. The model is a probabilistic , sure. But the harness around it is just code. Which means, unlike the model, the harness is something you can actually debug.

3:35Finn: And in the bill-splitting case, you can name exactly which parts of the broke. The tool description didn't enforce that the email field was required. The logging swallowed the errors instead of surfacing them. And the completion check accepted "I'm done" even though nothing in the world had changed. Three different pieces of plumbing, all failing in sequence. The model's in the clear.

4:01Juniper: Think of it like a competent new hire who keeps failing at a task. Your first thought is they're not sharp enough. Then you look closer, and the form they've been given is missing a required field, the system silently discards their submissions without telling them, and the "task complete" button lights up green no matter what. The person is fine. The workplace systems around them are broken. That's the 's situation exactly — except the agent can't even complain about its broken tools, because it has no idea they're broken.

4:37Finn: So that's the reframe, and it's genuinely against the grain. The last several years of AI have been one story: make the model better. Bigger, smarter, more reasoning. This paper is in the quieter countercurrent that says — for a huge fraction of real failures, the model is already good enough, and the bottleneck is the engineering . The bug isn't in the model. It's in the scaffolding.

5:03Juniper: And the , it turns out, is just software you can debug. That's the sentence the whole system is built to deliver on.

5:12Finn: Which sounds clean as a slogan. The hard part is the doing. Because debugging a normal program is easy in this respect — a bug lives at a specific line of code. Line forty-seven throws an error, you go look at line forty-seven. An failure doesn't work like that.

5:29Juniper: No, and this is the genuinely hard problem the paper is solving. When an runs, it leaves behind this sprawling record of everything that happened — its reasoning text, every , every response from the environment, every state change. The paper calls that the trace, or the . And it is a mess. It's language-heavy, it's long, and the actual cause of the failure could be buried anywhere in it. It's not a clean pointing at one line. It's more like a transcript of a very long, very rambly meeting where somewhere in the middle, somebody made the decision that doomed the project — and you have to find it.

6:10Finn: And it gets worse, because every framework logs its mess differently. One calls a field "prompt ," another calls the same thing "input tokens." One labels the actor "agent name," another says "recipient agent type." So even if you wanted to build tooling to analyze these traces, there's no common format to build against.

6:31Juniper: Which is where the first really nice idea comes in, and it's borrowed straight from compilers. When a processes your code, it doesn't work on the raw text. It first parses everything into a normalized internal structure — a clean graph of nodes with well-defined relationships — and then it analyzes and transforms that. The paper does the exact same trick to traces. It takes every framework's idiosyncratic log and converts it into one common structured form.

7:02Finn: An .

7:04Juniper: An — they give it a name, but the intuition is what matters. Imagine you're handed incident reports from a dozen people, all in different formats — one wrote prose, one filled out a form, one sent bullet points. Before you can compare them or analyze them, you transcribe every one into a single standardized template. Now you can ask the same uniform questions of each: what happened here, did it succeed, what did it change, and what came right before it? That's what this first stage does to the trace. Raw mess in, clean common structure out.

7:43Finn: And once it's in that common structure, each step gets tagged with three things that turn out to be the whole game. What role did this step play — was it gathering information, calling a tool, making an edit, checking for completion? Did it actually succeed or fail? And — this is the sharp one — did it change anything in the world?

8:05Juniper: That last tag is the one that catches the bill-splitting disaster. Because the finalization step in that trace did succeed, in the narrow sense that the code ran. But its "did it change anything" tag is empty. Three payment requests were supposed to exist. Zero do. The structure makes that contradiction visible in a way the raw log never did.

8:29Finn: So now they've got the trace tamed into something analyzable. That's one of four stages, right? Walk through the assembly line.

8:37Juniper: Four cooperating , and I think the cleanest way to hold it is as a literal assembly line with named handoffs. Stage one, the one we just described — abstraction. It takes the raw trace and produces the clean structured version. Stage two is diagnosis. It works backward from the observed failure — the said this task failed — and follows the links between steps back to a ranked set of suspects, then figures out which step is actually responsible and which layer of the it belongs to.

9:12Finn: And because one bad run could just be a fluke, it doesn't trust a single trace.

9:17Juniper: It doesn't. It consolidates diagnoses across many failed runs into what they call flaw records — recurring patterns. "Completion accepted without any task-relevant change" isn't a one-off; it shows up again and again, and that repetition is what makes it worth fixing. Then stage three is repair, which generates an actual code patch. And stage four is validation, which decides whether to accept that patch. Parse, diagnose, patch, validate. Four handoffs down the line.

9:50Finn: I want to sit on stage three, the repair stage, because the design choice there is the most opinionated thing in the paper, and it's where Juniper's "it's just software" framing gets interesting tension. The code that's implicated is often core orchestration logic — the heart of the . And there's a whole school of thought, the self-modifying-agent lineage, that says: let the agent freely rewrite itself. Maximum flexibility. The paper looks at that and says, no, that's how you get architectural breakage and chaos.

10:27Juniper: So what do they do instead?

10:29Finn: A fixed menu. A catalog of allowed repair operators, organized by which layer they fix — things like "completion guarding," "loop guarding," "argument validation," "-error logging." And here's the detail I like: those operators weren't invented by the authors out of thin air. They were from the actual version history of real repositories — the real fixes that real maintainers committed over time when their broke in these ways.

11:01Juniper: So it's institutional memory, encoded as a menu.

11:06Finn: Exactly. And the analogy that nails it is a surgeon. A surgeon doesn't improvise a brand-new operation mid-procedure. They work from a vetted catalog of known procedures, each one specifying what may be cut and what must absolutely not be touched. That's safer than freelancing — even though it means a genuinely novel condition with no established procedure leaves you stuck. The paper makes that exact trade deliberately. The diagnosed flaw gets mapped to an operator, and the operator gets turned into what they call a repair specification — basically a contract. It says which files you're allowed to touch, what's strictly off-limits — the benchmark data, the , the test sets — what behavior the patch has to add, and the bar it has to clear to be accepted.

11:56Juniper: And that acceptance bar is the one piece of real math in the paper, though it's math you can say in a sentence. A patch is accepted only if two things are both true. One: it demonstrably makes the diagnosed failure happen less often. Two: it doesn't break more than a tolerated handful of tasks that were already working. Both thresholds are tunable.

12:18Finn: Which is just classic regression discipline made explicit. Every working software team already does the second half — you change something, you re-run your existing tests to make sure you didn't quietly break what used to work. The first half is the new wrinkle, and it's subtler than it sounds.

12:37Juniper: Say more, because this is the part I think people will skip past.

12:42Finn: So how do you count whether a flaw "still occurs" on a validation task? You don't just check whether the task passes. You re-run the , and you only count the flaw as present if the failure that comes back is one the diagnosis machinery attributes to the same layer and the same cause as the original flaw record. So the success criterion is tied to the causal story, not the pass-fail number. A patch that accidentally makes a task pass for some unrelated reason doesn't get credit for fixing the flaw.

13:13Juniper: Which is elegant — and also, I suspect, exactly where your skeptical antenna starts twitching.

13:20Finn: Oh, we'll get there. Hold that thought, because it's the best critique in the episode. But let's finish the happy path first.

13:27Juniper: Let's go back to the bill-splitting disaster, because now we can watch the whole actually chew on it. The trace gets abstracted into six steps. Diagnosis works backward from the 's verdict — zero payment requests created — and it doesn't find one culprit. It finds three, spread across three different layers of the .

13:48Finn: Three separate plumbing failures stacking up.

13:51Juniper: Stacking up. The step where the built the request and left out the required email field — that's a tool-interface flaw. The request body was malformed and nothing caught it. The step where the errors got swallowed and never surfaced — that's an observability flaw. The was blind to its own failures. And the step where the finish routine got accepted despite nothing changing in the world — that's a lifecycle and verification flaw. Three layers, one cascade.

14:21Finn: And they all consolidate into a single flaw record: completion accepted without a task-relevant state effect.

14:27Juniper: Which maps to a primary repair operator — completion guarding — plus two supporting ones for the argument validation and the error logging. And the resulting patch constrains the so that the finish routine gets blocked when no relevant database change has actually occurred. The "done" button can't light up green over an empty database anymore.

14:49Finn: And here's the line I'd underline in the whole paper. No prompt edit could fix that. You can write the most beautiful in the world — "please remember the email field, please check your work, please don't claim success falsely" — and it does not touch the runtime mechanism that decides what "complete" means. That fix lives in code the prompt can't reach.

15:12Juniper: This is the spine of the paper, and I want to make sure the contrast is sharp, because the comparison isn't against doing nothing. There's a whole family of methods that already learn from an 's own traces — the paper compares against several. And they're good systems. But what do they actually edit? They edit the words you send the model. They evolve the prompt, they build up guidance memory, they refine instructions. Smart approaches. But their edit surface is almost entirely the prompt.

15:43Finn: And that's where the numbers get brutal, in the best way. The authors run an — their own system, but restricted to prompt-only edits. On , the command-line benchmark, the prompt-only version gets zero improvement. Not a little. Zero. Same six tasks as the completely untouched . The full system gets nine. Fifty percent more, and the entire gap is stuff a prompt was never able to touch.

16:09Juniper: And the paper has this clean diagnostic table behind it. The prompt-evolution methods only ever edit one layer — the context and memory layer. One of them reaches a little into tool descriptions. is the only method that systematically gets into the lifecycle, the observability, the verification, and the governance layers. That single fact is the whole explanation for why it wins. The other methods can sometimes even see the symptom — they're looking at the same trace — and they still can't repair it, because the repair lives in machinery they're not allowed to touch.

16:49Finn: Debugging by superstition versus debugging by inspection. That's the difference. The old way: tweak the prompt, add a retry, bolt on a , re-run the benchmark, see if the number went up. This is: which exact part of my is responsible, and here's a narrow patch, validated against regressions.

17:10Juniper: So let's put real numbers on the win, and then Finn, I think you should take the knife to it. Across the four benchmarks — and they're genuinely different domains, repository bug-fixing, command-line workflows, open-ended research questions, and that stateful app automation world the bill-splitting lives in — the held-out test improvement over the starting ranges from about fifteen percent to fifty percent. And critically, the test tasks are never seen during repair or validation. Same protocol everywhere, two-to-one-to-two split, test set untouched.

17:48Finn: And it beats the humans.

17:50Juniper: It beats the humans. On every single benchmark, bolted onto a basic starting comes out ahead of the strongest hand-built, human-designed harness for that domain. And the accepted edits are wonderfully concrete. On the command-line benchmark, it learns to block commands like exit, logout, and shutdown — things that kill the session out from under the . On the repository benchmark, it rejects a sweeping version-control command that would stage every changed file and risk submitting junk. On the research benchmark, it added spreadsheet and audio-file support, and made file-description failures non-fatal instead of letting them tank the run.

18:33Finn: Okay. Those last ones are great, actually, because they're so mundane. "Made file-description failures non-fatal." That's not AI magic. That's a competent engineer reading a log and going, "oh, this shouldn't be a hard crash." Which is sort of the point — and also my way in to the critique. Juniper, can I push on the whole thing?

18:55Juniper: Please. Take it apart.

18:57Finn: The hardest problem is circularity, and it's the thing I flagged earlier. Every single stage of this system is itself an LLM . The abstraction is an LLM. The diagnosis is an LLM. The repair is an LLM. The validation is an LLM. And remember how a flaw is counted as "still occurring" — it's whether the diagnosis machinery attributes the new failure to the same cause. So the system that decides a flaw has been reduced is the same system that diagnosed the flaw in the first place.

19:28Juniper: It's grading its own homework.

19:30Finn: It's grading its own homework. The student writes the exam answer and then grades the exam. Now — to be fair, and the paper does earn this — there's a real external check. The headline numbers come from the held-out test set, scored by each benchmark's own independent . That's the outside examiner grading the final, even if all the practice rounds were self-graded. So I trust the top-line improvement numbers. What I don't have is any independent verification that the internal diagnosis is correct. When it says "this was an observability flaw in step five," there's no ground-truth fault label confirming that. It's the model's judgment, checked by the same model's judgment.

20:14Juniper: And the diagnosis being correct is load-bearing for the whole "evidence-grounded, not superstition" claim. If the localization is wrong but the patch happens to help anyway, you're back to luck — just dressed up in better vocabulary.

20:28Finn: Right. Second problem — and this one the surgeon analogy already set up. The repair catalog is hand-curated. Distilled from real repos, which is a genuine strength for safety. But it means the system can only fix flaws for which an operator already exists. A truly novel failure mode, outside the menu, has no edit surface at all. Which is, in a narrower form, the exact criticism the paper levels at the prompt-only methods. They can't reach the runtime; can't reach a flaw with no procedure in its catalog. The surgeon with no established procedure for your condition just... can't operate.

21:06Juniper: That one feels honest to me as a designed-in limit rather than an oversight, but it's real.

21:12Finn: It is. Third — and this is the one I'd want a reviewer to press on. The absolute gains on the hard benchmarks are small in raw terms. "Fifty percent improvement on " sounds enormous. It's six tasks to nine, out of thirty-four. When you're moving three tasks, a little noise in either direction matters a lot — and the paper reports single runs. No variance, no . So we genuinely can't tell how stable these deltas are. The story is consistent across four benchmarks, which helps. But "consistent direction" is not the same as "statistically nailed down."

21:50Juniper: And it's all one base model, one configuration.

21:53Finn: One model throughout — a smaller variant. They fix the model deliberately, which is the right call for isolating the effect. But it leaves the obvious question dangling: does harness repair help this much with a much stronger model? Or does a stronger model just... not make these mistakes, so there's less to repair? They mention auxiliary experiments varying model strength but don't show them in what we've got. So that's asserted, not demonstrated.

22:22Juniper: There's also something subtle in how the "beats the humans" comparison is framed.

22:27Finn: There is, and it's the one I'd flag as mildly apples-to-oranges. is bolted on top of a basic starting and given a training budget to adapt. The human-designed baselines are other people's complete, separate — not built on the same starting point, and not given HarnessFix's adaptation budget. So "beats the strongest human harness" is impressive, but it's not a clean controlled head-to-head. It's more "our adaptive method on top of a base, versus their static finished products."

23:00Juniper: All of which I think is fair, and worth saying the authors are fairly contained about their own concessions. The one limitation they put front and center is generalization — they're explicit that the recurring flaw patterns they found are tied to these four benchmarks, and that whether those patterns hold across other domains is future work. The variance question and the single-model question, they don't dwell on. So those are ours to voice, not theirs.

23:30Finn: And I want to be clear that the critique doesn't sink the paper. It scopes it. This is an offline repair loop — you collect traces, you run this over them, you get a patched . It's not a live, self-healing fixing itself in real time. The gains are real and consistent and moderate. That's a perfectly respectable thing to be. It's just not the thing a breathless headline would make it.

23:56Juniper: And I'd argue the reframe is worth more than the numbers anyway. There's one more finding I love, because it's the system explaining its own design. They tried removing the regression check — the "don't break what already works" half of the acceptance rule. On the repository benchmark, it barely hurt. Fifty-three tasks instead of fifty-seven. Which means the diagnosis and repair stages were already producing good candidates on their own.

24:24Finn: But on the app-automation benchmark, the one with the bill-splitting, the gap got wider without the guard.

24:31Juniper: Wider — thirty-five instead of thirty-eight. And the explanation is lovely: that domain has the most interconnected layers, where a fix in one place easily creates a regression in an adjacent one. So the regression guard matters most exactly where the system is most tangled. As the authors put it, acceptance is what turns useful candidates into safe-to-accept changes. The safety net earns its keep precisely where the fall would be worst.

25:00Finn: Which, stepping back, is the real contribution. Not the specific numbers. The discipline. Today, when an product misbehaves, teams mostly grope. They tweak prompts, add retries, bolt on , re-run the benchmark, pray the number went up. This paper says: you can ask which part of your is responsible, get a localized, evidence-grounded answer, and then a narrow patch checked against regressions. That's a different posture toward failure entirely.

25:29Juniper: And it drags into the light a failure mode that benchmark scores actively hide. An that confidently marks a task complete while having changed nothing in the world — that's a reliability nightmare, and a single pass-fail number sometimes won't even catch it. The autoresponder that closes every ticket with "resolved." Your dashboard is gorgeous. Your customers are furious. No amount of prompt-tuning reaches that, because the lie isn't in what the model says — it's in how the decides what "done" means.

26:01Finn: The model is fine. The plumbing is broken. And the plumbing is software.

26:06Juniper: That's the whole paper in seven words, and I think it's the right note to leave people on. The field has poured extraordinary effort into making the models smarter. This is a quiet, useful argument that for a lot of real failures, the smartest thing you can do is go read your own logs and fix your own — and that the scaffolding will hold still long enough to be fixed.

26:29Finn: The paper's called "From Failed Trajectories to Reliable LLM Agents." The show notes have a link to it, along with some related reading if you want to follow this thread into the self-improving- debate it's quietly picking a fight with.

26:44Juniper: And if you want the full transcript with every term defined inline — , , all of it tappable — plus the links over to other episodes that touch these same ideas, that's all on paperdive.ai.

26:58Finn: This has been AI Papers: A Deep Dive. Thanks for listening.