All episodes

Episode 169 · Jun 24, 2026 · 24 min

Why Better Bug Reports Can Make AI Coding Agents Worse

Tamoyan, Narenthiran, Arakelyan et al.

LLM Agents

AI Papers: A Deep Dive — Episode 169: Why Better Bug Reports Can Make AI Coding Agents Worse — cover art

paperdive.ai

Listen

Ep. 169

Why Better Bug Reports Can Make AI Coding Agents Worse

0:00

24 min

Concepts in this episode

Agentic Workflows Evaluation & Benchmarks AI Efficiency & Cost Agentic Coding SWE-bench Chain of Thought Tool Use Context Management LLM-as-Judge Software Engineering Automation Ablation Studies Inference Cost Trajectory Quality Capability vs. Propensity Test-Time Compute

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

SHERLOC: Structured Diagnostic Localization for Code Repair Agents

Venue

arXiv:2606.24820

Year

2026

Read the paper

arxiv.org/abs/2606.24820

Also available on

Apple Podcasts Spotify

Hand a capable AI coding agent a more accurate report of where a bug lives, and it can fix fewer bugs than with nothing at all. This episode digs into SHERLOC, a paper arguing the field has been scoring localization like a search engine when what actually matters is the diagnosis — and shows where the impressive numbers stop being deployable.

What you'll take away

Why AI coding agents spend roughly 48% of their turns and over 320,000 tokens just locating a bug before writing any fix
How SHERLOC reframes localization from 'find the right file' to a structured five-field diagnostic case file
Why a single setting — thinking mode off — collapses the same model from 74% recall to 10%, with 87% of runs producing no valid output
The capability-dependent transfer finding: weak repair agents gain 8-12 points, while strong agents can lose ground when fed findings indiscriminately
Why a low-quality diagnosis (20% resolve rate) drags an agent below the 62% baseline of having no report at all
The two honest limits: the quality filter relies on the ground-truth patch and isn't deployable, and ~58% of recall may come from memorized famous libraries

Chapters

00:00The taxi meter that never stops
02:47Red circle versus the written report
05:12One setting flips everything
09:09Can the underdog beat the specialists?
10:25Does it just remember Django?
12:12The map that distracts the cabbie
16:43The filter you can't actually ship
21:25What actually survives the critique

References in this episode

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — The benchmark this episode's results are measured on — the real-GitHub-bug-plus-
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering — The agent-framework lineage behind the repair agents SHERLOC injects case files

Full transcript

Also available as a plain-text transcript page.

0:00Juniper: Here's a result that shouldn't be true. Take a strong AI coding agent — the kind that fixes real bugs pulled from real GitHub projects — and hand it a more detailed, more accurate report of where the bug is. And it fixes fewer bugs than it did with nothing at all.

0:18Finn: Quick heads up before we get into it — this is an AI-made explainer, both voices included.

0:24Juniper: So by the end of this you'll understand why that happens, and why it points at something much bigger: that the entire way this subfield measures success — "did you find the right file?" — might be scoring the wrong thing entirely.

0:39Finn: And it's genuinely backwards from how you'd expect it to work. More information, better information, fed to a capable agent — and the agent gets worse. That's not a bug in their experiment. It's the finding. The paper is called SHERLOC, and the name is the whole thesis: this is detective work, and the case file matters more than the address.

1:01Juniper: Why should anyone outside this niche care? Because right now, if you're running or paying for an AI coding agent, you are burning roughly half your compute before any actual fixing happens. And this paper measured exactly that — then tried to claw most of it back.

1:18Finn: Half? You mean half the time it's just... looking?

1:22Juniper: Just looking. They instrumented this across five different language models and two agent frameworks, and the number is brutal. On an average bug, the agent spends about eighteen and a half turns — call it forty-eight percent of its entire interaction — and north of three hundred and twenty thousand tokens, purely figuring out where the bug lives. Before it writes a single line of a fix.

1:47Finn: So localization — finding the fault — isn't some side quest. It's the dominant cost of the whole operation.

1:54Juniper: It's the meter running while the taxi circles the block. And here's the mental model for the agent itself: think of an automated developer that can only do one thing at a time and has to keep re-reading the screen to remember what it just saw. Read a file, search for a string, run a test, look again. Every one of those steps costs tokens, costs money, costs latency. And forty-eight percent of those steps go to the hunt.

2:23Finn: Which is why the field built dedicated tools just for that hunt. Localization systems — their whole job is to point at the faulty file so the repair agent doesn't have to wander.

2:34Juniper: Right, and those tools get scored like search engines. Did you retrieve the correct file? Top result, did you nail it? And that scoring is exactly what the authors think is broken.

2:47Finn: This is the heart of it, and it's worth slowing down on, because everything downstream depends on getting this distinction. Picture a radiologist who hands a surgeon an X-ray with a red circle drawn around one spot. That circle tells the surgeon where. It says nothing about what the spot is, why it's there, or whether you should cut, medicate, or just wait and watch.

3:11Juniper: And a file path is that red circle.

3:14Finn: Exactly. A bare location is operationally underspecified. What the surgeon actually needs isn't the circle — it's the written report. The diagnosis. So SHERLOC's move is to stop treating localization as retrieval and start treating it as diagnosis. For every place it flags, it doesn't just emit a file name. It emits a structured finding — five fields.

3:37Juniper: Lay them out.

3:38Finn: A location explanation — where and why this spot. A root-cause hypothesis — what's actually going wrong. A solution direction — how you'd fix it. The relevant dependencies — what else this touches. And the testing impact — what breaks or passes if you change it. Five fields, every time. That's the case file.

3:59Juniper: And to make that concrete — they've got a real Django example in the paper. There's an ordering bug, and SHERLOC doesn't just say "look in base dot py." It points at two specific lines and explains that an internal check fails to recognize a certain kind of lookup. That second half — the "fails to recognize" part — is the diagnosis. That's the sentence a repair agent can actually act on.

4:26Finn: And the claim — their most quotable line — is that structured diagnostic output, not location retrieval, is the operative unit of useful localization. A correct file with a misleading diagnosis still misleads the agent downstream.

4:41Juniper: Hold that thought, because that exact sentence comes back to bite in a way that's genuinely surprising. But first — how do they actually produce this case file? Because the how is almost aggressively simple.

4:56Finn: This is the part worth settling into — the actual machine — and it pays off in one of the most vivid results in the paper, where the same model, with one setting flipped, goes from genuinely good to almost completely broken.

5:11Juniper: One setting?

5:12Finn: One setting. We'll get there. But the architecture first. SHERLOC is training-free. No fine-tuning, no reinforcement learning, no swarm of agents debating each other. It's one reasoning model, given a tiny fixed menu of tools, running inside a bounded loop — at most twenty turns. That's it.

5:32Juniper: And "tiny menu" is not an exaggeration. Four tools. View a file — read it, optionally a line range, with fuzzy matching so a near-miss filename gets corrected instead of wasting a turn. Search the codebase — text search across the repo, returns snippets with context. The repository tree — basically the folder structure, so the model can regain its bearings. And a connected tree, which follows import relationships in both directions, so it can trace a dependency thread from one module to another.

6:07Finn: And the design philosophy behind that spartan toolkit matters. Prior work found that if you let these models write arbitrary shell commands or Python, they derail — they fall down shell-debugging rabbit holes and never come back. So SHERLOC draws a hard boundary. The model chooses an action from the fixed menu; a separate deterministic piece validates it and runs it. Clean line. And because the model isn't writing tool code itself, you can use pure reasoning models that are brilliant at deliberation but were never trained on tool APIs at all.

6:43Juniper: Then there's a layer they call self-recovery, which is the underappreciated glue. Long multi-turn sessions fail in predictable ways, and SHERLOC patches each one. The context gets too long — so it keeps the original bug report and the most recent turns, and drops the stale middle. The model gets stuck repeating the same call — there's a loop warning. A formatting slip mangles a tool request — it parses the intent anyway instead of burning the turn.

7:13Finn: And the single most important one: final-turn synthesis. When the budget runs out, it forces the model to commit a best-guess diagnosis rather than just... producing nothing.

7:24Juniper: Which the ablations back up cleanly. They knocked out components one at a time, and two carry almost all the weight. Remove "view file" — the ability to actually read source — and you lose seven points of accuracy. Remove that forced final-turn synthesis and you lose five. So inspection and forced commitment are the load-bearing walls. Everything else is trim.

7:49Finn: Now — the setting I promised. They ran the exact same model two ways. Once in "thinking" mode, where it generates a long internal chain of reasoning before each move. And once in plain instruct mode, answering directly.

8:03Juniper: Same weights, same prompt, same tools.

8:06Finn: Same everything. Thinking mode: seventy-four percent recall — solid. Instruct mode, thinking off... ten percent.

8:14Juniper: From seventy-four to ten?

8:16Finn: And it's not a graceful slide. Eighty-seven percent of the instruct runs fail to produce any valid output at all. They just don't complete the protocol. So the extended deliberation isn't decoration on top of a working system — it is the engine. Take the reasoning away and the whole multi-turn detective loop collapses into noise.

8:38Juniper: That's a genuinely clean demonstration. It tells you the chain-of-thought isn't making a working thing marginally better — it's the precondition for the thing working at all.

8:50Finn: So where does all that deliberation go, token-wise? Of about twenty-nine thousand tokens in a typical run, roughly twenty thousand is the model's own reasoning. Only about nine thousand is tool output. The cost is thinking, not querying. And the average run is under five turns. It's not flailing around the repo — it's sitting and reasoning hard, then acting deliberately.

9:15Juniper: Okay. So that's the detective. Does the case file actually beat the specialists?

9:20Finn: This is where you'd expect the training-free underdog to lose, right? It's going up against systems that were fine-tuned, or built around multi-agent debate, or trained as dedicated code retrievers.

9:33Juniper: And it wins. On SWE-Bench — that's the standard benchmark, real GitHub bugs paired with the pull requests that actually fixed them — SHERLOC hits state of the art. About eighty-four percent top-one accuracy on the Lite split. And on the Verified split, eighty-one percent recall. The prior best was sixty-eight. That's a thirteen-point jump.

9:56Finn: And the prediction the theory makes is: if structure is really doing the work, then you shouldn't need scale to win. Smaller model, same discipline, should still beat a bigger trained one.

10:08Juniper: And it does. At a matched thirty-billion-parameter scale, SHERLOC beats a fine-tuned thirty-two-billion specialist by seven points — and beats the non-fine-tuned baselines at that size by sixteen to eighteen. No training. Just the structured loop. So structure really is substituting for both compute and specialized training here.

10:31Finn: Now — I want to plant something honestly, right here, while the numbers look great. Because the authors themselves do. These benchmarks are built from famous open-source projects — Django, scikit-learn — and those projects are all over the training data of every model involved. So when a model "finds" a bug fast, you can't fully tell whether it reasoned its way there or just... remembers where things live in Django.

10:59Juniper: That's the contamination worry.

11:01Finn: It's the worry that sits under every headline number in this whole subfield, and we'll come back to exactly how much it costs them. But credit where it's due — they ran a serious control for it, which is rare.

11:15Juniper: They did. They ran what's basically a masking gauntlet. Progressively strip away the clues — take away the tools, take away the repository tree, take away the explicit file paths mentioned in the bug report — and watch how much performance survives. The intuition: it's like testing whether someone can navigate a city. If they grew up there, acing the test tells you nothing. So you blindfold them, drop them somewhere, and see how much they can still work out by walking around.

11:47Finn: And the gap is the answer. With tools available but the file paths hidden, recall holds around eighty percent. Strip it all the way down to just the bug report text — no tools, no tree — and recall falls to about fifty-eight. That roughly twenty-two-point gap is their estimate of how much comes from real active exploration versus pure hometown familiarity.

12:11Juniper: Which is a real number on a real problem. Hold the size of that fifty-eight, though — Finn's going to come back and lean on it hard.

12:20Finn: I am.

12:20Juniper: But here's the part I find most surprising in the whole paper — and it's the payoff for that "diagnosis, not location" reframe we set up. They took SHERLOC's case files and injected them into actual repair agents to see if more bugs got fixed. And the answer is: it depends entirely on how good the agent already was.

12:41Finn: This is the capability-dependent transfer finding. And there's a clean analogy for it. Imagine giving directions to two drivers. A tourist who's never been to the city is grateful for every single turn-by-turn instruction, and gets there way faster. But a veteran cab driver who already knows every street? A backseat passenger barking low-confidence directions actually slows him down. Better to stay quiet unless you're sure.

13:08Juniper: And that's exactly the heatmap they show. Picture a grid — repair agents down one side, with and without SHERLOC across the top. The weak agents, the ones that are bad at finding bugs themselves, soak up the case files and jump eight to twelve points in how many bugs they resolve. The strong agents — the ones that already localize well on their own — they can actually lose ground when you feed them every finding indiscriminately. Because some of those findings are low-confidence noise, and the noise dilutes instincts that were already good.

13:44Finn: So the tourist gets the map; the cabbie gets distracted by the map.

13:48Juniper: And now go back to that quotable line — a correct file with a misleading diagnosis still misleads the agent. They quantified that, and it's the sharpest result in the paper. They had a judge score each diagnosis, and then looked at how often bugs got fixed by quality tier. Very-high-quality findings: bugs resolved about seventy-six percent of the time. Low-quality findings: twenty percent.

14:13Finn: And the baseline — the agent with no findings at all — was sixty-two.

14:19Juniper: That's the whole thing right there. A bad diagnosis doesn't just fail to help. It drags the agent below where it would've been alone. A confidently wrong case file sends the surgeon down the wrong path — worse than handing them no report and letting them think.

14:36Finn: And that's not a story about "text in the prompt helps." They proved that with a beautifully simple control. They took a finding from a completely different, random bug and injected that. If gains came from just having structured text around, the random finding should help too. It didn't — the shuffled findings degraded performance in nine of ten cases. So what's transferring is genuine diagnostic relevance, not vibes.

15:05Juniper: And there's a statistic holding the whole transfer claim up — a moderate but real correlation, around point four-five, between how good a diagnosis is and whether the bug actually gets fixed. Not perfect. But strong enough that filtering on quality works. And the single strongest sub-component of that quality? Not the location accuracy. The solution actionability — the fix direction. Which is the deepest confirmation of the thesis: what transfers downstream is the how-to-fix-it, not the where.

15:38Finn: And then the efficiency side, which is almost a free lunch for the strong agents. One of the big models — a four-hundred-eighty-billion-parameter backbone — its localization token spend dropped from about a hundred-eighty-nine thousand tokens down to sixty-five thousand. A sixty-six percent cut. And its resolve rate didn't move at all. Stayed flat.

16:02Juniper: So the strong agent doesn't get more accurate, but it stops paying to scout. It pays the localization tax once, cheaply, with the standalone detective — and then just... arrives ready to work. Across the board, that's roughly a third fewer tokens on searching, a quarter fewer overall, and about six more points of bugs fixed on average.

16:24Finn: So the clean takeaway is: don't feed findings uniformly. Weak agents want the diagnosis. Strong agents want the filtered diagnosis, or just the efficiency. The right move is selective injection — give the report when it's reliable, otherwise let the agent trust itself.

16:41Juniper: Which sounds like a solved problem. And this is where I think you've been waiting to pounce, Finn.

16:47Finn: I have. Because that filter — the thing that makes the transfer reliable, that prevents the bad diagnoses from poisoning strong agents — is not deployable as presented. And this is the steelman, the place where the impressive numbers hit their actual limit.

17:03Juniper: Walk through it.

17:04Finn: The filter works by having a judge score each diagnosis and screen out the bad ones. But the judge is shown the ground-truth patch. The answer key. It's like grading a student's problem-solving by letting the grader peek at the solved exam first. It tells you what's possible if you had a perfect oracle. It does not tell you what you can do at test time — when, by definition, you don't have the fix yet. That's the whole reason you're running the agent.

17:32Juniper: The authors are explicit about that, to be fair.

17:35Finn: They are, and that's exactly why I trust the rest of the paper. But the cleanest version of their "we prevent negative transfer" story leans on information you wouldn't have in production. Until somebody builds a patch-free way for the system to judge its own findings, the quality-filtered results read as an analysis of what's achievable — not a working pipeline you can ship.

17:58Juniper: And the contamination point you planted earlier — that's the second edge.

18:02Finn: That's the bigger one. By their own measurement, about fifty-eight percent of localization recall on the Verified set is reachable from the bug report text alone — and it's concentrated in the famous libraries. Their masked control bounds the problem, but it doesn't eliminate it, and they say so. A truly clean test would need a held-out repository distribution — a city the model genuinely never visited — and that just isn't part of this benchmark.

18:30Juniper: You can even see it in the per-project breakdown.

18:33Finn: You can, and it's stark. With the paths masked, recall on scikit-learn is eighty-five percent, on requests eighty-seven. But on a couple of less famous projects? Thirty-three percent. The model clearly "knows" the popular libraries from pretraining. So a fair skeptic can say: the headline numbers partly reflect how well these models memorized Django, and we don't know how much of SHERLOC's edge survives on a genuinely novel codebase. The whole field shares this problem — but sharing it doesn't make the numbers transfer.

19:06Juniper: And there's a third thing on the cost side.

19:09Finn: There is. The very best numbers come from a two-hundred-thirty-five-billion-parameter model running up to twenty reasoning turns. That's a far heavier per-bug serving cost than the thirty-two-billion fine-tuned baselines it's being compared against. The authors honestly point to the thirty-billion row as the fair, like-for-like comparison — but the abstract leads with the giant. So the headline slightly oversells the practical advantage. And the whole downstream story is Python, two frameworks. That import-tracing tool needs a language-specific parser. JavaScript, Java, C++ — untested. "Drop-in across any backbone, any language" is aspirational, not demonstrated.

19:51Juniper: I'll concede all of that. What I won't concede is the conceptual core — and I don't think your critique touches it. Whether or not the absolute numbers are inflated by memorization, the relative finding stands: across all those controlled comparisons, diagnosis quality predicts repair success, and bare location doesn't. The shuffled control rules out the "text in the prompt" explanation. So even on a held-out codebase where everyone's numbers drop, the case-file-beats-address reframe should still hold.

20:24Finn: That I'll grant you. The reframe survives the contamination worry, because it's a claim about relationships, not absolutes. The deployable pipeline is where I keep my reservation — and I think it stays open. They've shown what an ideal filter could do. Nobody's yet shown the filter you can actually run.

20:45Juniper: Fair. And honestly, the fact that we can have this argument cleanly is because the paper handed us the tools to. They ran the masking controls. They flagged the non-deployable judge in their own words. They steered readers to the cost-fair row. That's about ten thousand GPU-hours of work, and a lot of it went into trying to disprove themselves. That's what good contamination-aware evaluation looks like, and it's rarer than it should be.

21:13Finn: Agreed. The epistemic hygiene here is the model, not just the method.

21:18Juniper: So let me land where this actually leaves us. The durable result isn't "SHERLOC tops the leaderboard." Leaderboards move. The durable result is the reframe: for years, localization has been scored like a search engine — did you fetch the right document. And this paper makes a strong case that that's measuring the wrong unit entirely. What changes a repair agent's behavior isn't the address. It's the diagnosis — the why and the how-to-fix. And once you accept that, a pile of received wisdom flips. More accurate isn't automatically more useful. A confident wrong answer is worse than no answer. And the same case file that rescues a weak agent can sandbag a strong one.

22:04Finn: Which is the lesson that travels furthest beyond code. Don't consume findings uniformly. The value of information depends on who's receiving it and how sure you are it's right.

22:16Juniper: So here's the question for you. Should the field keep pushing localization to be smarter — better detectives writing better case files, the way this paper does — or is the deeper signal that we should stop scoring these systems on retrieval accuracy at all, and start scoring them on whether the diagnosis actually changes the fix? Pick one. We read the replies.

22:40Finn: If you want to go deeper, the full annotated version of this episode is on paperdive.ai — every technical term tap-to-define, with links to the related papers grouped by theme, including the SWE-Bench and contamination work we leaned on, plus our weekly and monthly roundups.

22:59Juniper: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Finn and I are both AI voices from Eleven Labs, and our producer isn't affiliated with either company. The paper is SHERLOC, on structured diagnostic localization for code repair agents, published June 23rd, 2026 — we recorded this the very next day.

23:22Finn: So the next time an agent spends half its budget circling the block — remember, the fix wasn't a better map. It was sending a detective ahead to write the case file.