All episodes
Episode 132 · Jun 11, 2026 · 30 min

The Agent Failed — But Did the Instructions Deserve to Be Followed?

Gautam, Radhakrishna, Gulwani

LLM Agent Systems Prompt Engineering Skill Libraries
AI Papers: A Deep Dive — Episode 132: The Agent Failed — But Did the Instructions Deserve to Be Followed? — cover art
paperdive.ai
Ep. 132
The Agent Failed — But Did the Instructions Deserve to Be Followed?
0:00
30 min
Paper
SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement
Venue
arXiv:2606.10546
Year
2026
Read the paper
arxiv.org/abs/2606.10546
Also available on
Apple Podcasts Spotify

When human experts write instruction documents for AI , pass rates jump sixteen points. When the model writes its own, the improvement is exactly zero — even though the documents look great. Microsoft's paper diagnoses why, with a fault-attribution trick that separates 'the instructions were bad' from 'the agent ignored good instructions' — and an honest look at how much the fix actually buys.

What you'll take away

  • Why LLM-authored skills score zero improvement despite looking fluent and detailed — the valuable content is failure-derived trivia the model can't generate from general knowledge
  • How runs the twice (with and without the ) and uses the skill's own stated rules as the grading , so no external answer key is needed
  • The fault-attribution principle: an identical failure demands opposite repairs depending on whether the violated rule was precise enough to deserve following
  • The surprising decomposition: refined skills added nothing to per-attempt correctness — the entire gain came from helping the finish tasks at all, suggesting skills are institutional knowledge, not extra IQ
  • In the streaming experiment, refinement bought compression and discoverability (22 skills loaded twice as often) rather than accuracy — the naive 69- hit the same 52% pass rate
  • Where the headline claims weaken: under the benchmark's native scoring closes only ~11% of the gap to human skills, are wider than the effect, and the authors' own 'fair grader' is an they built themselves

Chapters

  1. 00:00The zero-improvement puzzle
  2. 03:47What a skill is, and why one bit of feedback isn't enough
  3. 07:35The two-run differential diagnosis
  4. 11:23Trigger geometry: measuring targeting with embeddings
  5. 15:11Fault attribution: whose fault was the wrong shade of yellow?
  6. 18:58Results: skills don't make agents smarter, they keep them from tripping
  7. 22:46The flywheel experiment: compression, not accuracy
  8. 26:34The steelman critique and the wrong-way flywheel

References in this episode

Also available as a plain-text transcript page.

0:00Bella: There's a benchmark called that was built to measure one very specific thing. If you hand an AI a well-written instruction document — what the field calls a "" — how much better does the agent get at its job? And when human experts write those documents, the answer is: a lot better. Pass rates jump by about sixteen percentage points. That's a big lever. So researchers asked the obvious follow-up — can the language model just write its own skill documents? And the answer came back zero. Not a small improvement. Zero. The model produces documents that are fluent, detailed, completely plausible-looking, and that help exactly as much as handing the agent nothing at all. The paper that digs into why — and what to do about it — is called ": Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement," from a team at Microsoft. It went up on on June ninth, twenty-twenty-six, and we're recording two days later.

1:06Tyler: And before we chase that zero, quick ground rules. The voice you just heard is Bella. I'm Tyler. We are both AI voices from Eleven Labs, the script for this episode was written by Anthropic's Fable 5 — so the whole show is AI-generated — and the producer has no affiliation with Anthropic or with Eleven Labs. Which gives this episode a slightly funny mirror-in-a-mirror quality, because it's AI-generated content about why AI-generated content fails.

1:37Bella: And the failure turns out to be much more interesting than "the model writes bad documents." Because the documents aren't obviously bad. If you read one, it looks great. It has sections, it has examples, it has confident step-by-step guidance. It just doesn't move the needle. So the puzzle isn't really about writing quality at all — it's about feedback. Nobody, human or machine, is getting told why a failed.

2:06Tyler: Let's back up one step, though, because I think "" is doing a lot of work here and not everyone has the same picture. We're not talking about code, and we're not talking about training the model.

2:19Bella: Right — a is documentation. Imagine a file sitting in a folder the can browse, titled something like "How to fill out Word templates correctly." It has a short description so the agent knows when to pull it up, some trigger conditions, and then a body of instructions. If you've heard of 's agent skills, this is exactly that pattern. The mental model I'd give people is an employee handbook that the intern may or may not consult. The agent reads it at runtime, and crucially, the agent keeps full discretion. It can ignore the skill. It can misread it. It can never open it at all.

2:58Tyler: And that discretion is the whole ballgame, isn't it? Because if the is free to ignore the document, then when a task fails, you genuinely don't know what happened. Was the document wrong? Did the agent skip it? Did the document never get loaded in the first place?

3:17Bella: That's exactly the diagnosis the authors land on. The only signal anyone has today is task-level pass or fail. One bit. And that one bit collapses at least four completely different failure modes into a single undifferentiated "it didn't work." Maybe the instructions were genuinely wrong. Maybe the fired on the wrong kind of task. Maybe the ignored perfectly good guidance. Or maybe the skill only covered one of several valid ways to solve the problem, and the agent happened to take a different road. Four diseases, one symptom. And here's the part that makes naive refinement actively dangerous, not just slow. If your improvement loop treats every failure as "the skill must be bad," you end up rewriting rules that were already correct — because the agent botched them, not because they were wrong. You don't just fail to improve. You erode your good content over time.

4:15Tyler: That's worth sitting with, because it's an instance of a really old problem in learning systems — . When something goes wrong in a multi-part , which part deserves the blame? Get that wrong and your fixes compound the damage. And the reason this version of the problem is hard is that there's no oracle anywhere. Compare it to the famous -library work before this — , the Minecraft that built up a library of reusable abilities. Voyager could improve its skills because the environment told it the truth: the code either ran or it crashed, the goal was either achieved or it wasn't. Free verification. But "format this spreadsheet for the client" has no . Nobody labels the output in deployment. No test suite, no game score, no reward function. So the question this paper is really asking is: can an agent diagnose and repair its own instruction documents using nothing but its own behavior? It's the difference between coaching a student with an answer key versus coaching them by watching two of their attempts and reasoning about what changed.

5:32Bella: And that framing — watching two attempts — is literally the core move. Here's the insight in one sentence: a document already contains its own evaluation criteria. The skill says when it should activate. The skill says what rules to follow. The skill implies what a good outcome looks like. So you don't need an external answer key — you run the on the task twice, once bare and once with the skill injected, and you grade the difference against the skill's own stated rules.

6:06Tyler: Which is a genuinely elegant move, because the thing being evaluated supplies the for its own evaluation. Though I'll flag now, and we'll come back to this — every one of those grading steps is an LLM making a judgment call. There's no ground truth hiding anywhere in this system. Keep that in your back pocket.

6:29Bella: Noted, and we will absolutely come back to it. So with those two runs in hand, diagnoses the difference along four dimensions, and the best way to hear them is as a funnel — like a doctor running a differential diagnosis instead of just being told "the patient didn't get better, change the treatment." Question one: did the help at all? An looks at the with-skill output and the without-skill output, in randomized order so it doesn't know which is which, and decides — better, worse, or tie, and by how much. That's the outer-loop signal. A small regression means do some local rewriting. A catastrophic one means the skill's whole strategy might be wrong. Question two: did the skill fire on the right tasks in the first place? Maybe the instructions are fine but the targeting is off.

7:23Tyler: Like prescribing the right drug to the wrong patient.

7:26Bella: Exactly. Question three is the big one: did the actually follow the instructions — and if not, whose fault was that? And question four: does the cover all the legitimate ways the agent might approach this task, or just one of them? Four questions, four different repairs. Everything gets into what the authors call an improvement brief — a structured document of findings — and an LLM uses that brief to rewrite the skill. So instead of "the task failed, try again," the rewriter gets "your trigger description is too narrow, rule three is too vague to follow, and you don't cover the formula-based approach at all."

8:08Tyler: Let's take the targeting one first, since it's the most geometric. How does a system measure whether a fires on the right tasks without ever deploying it?

8:19Bella: With — and if that word is new, here's the only picture you need. An embedding model converts a phrase into a point on a giant semantic map, where phrases with similar meanings land near each other regardless of exact wording. "Build a budget spreadsheet" and "create an expense tracker" sit close together. "Summarize this PDF" sits far away. Once you have that map, you can do geometry on meaning. So picture each as a store in a mall, with signage. The sign says who should come in — "create spreadsheets from scratch" — and who shouldn't — "not for chart images." Plot all those phrases on the map and a good skill carves out a clear catchment area with crisp boundaries. A bad skill is a store whose welcome sign only attracts a sliver of its intended customers, or whose "no entry" sign hangs confusingly close to the front door, so the wrong shoppers keep wandering in. measures exactly those three things: is the activation zone wide enough, are the exclusions far enough away, and is any single ambiguous phrase sitting dangerously right on the boundary.

9:31Tyler: There's one technical wrinkle in there I actually think is clever enough to spell out. The word "spreadsheet" appears in both the welcome sign and the no-entry sign — "create a spreadsheet" versus "do not trigger for spreadsheet chart images." A naive of just the word would put both in the same spot. But they embed each phrase with its surrounding sentence, so the negation context pulls the excluded "spreadsheet" toward a different region of the map. Same word, different neighborhoods, because the sentence around it changes what it means.

10:08Bella: And the trigger analysis produced one of my favorite incidental findings in the whole paper. When they measured human-written skills on these metrics, it turns out humans almost never write exclusion clauses at all — the negative-specificity score for human skills is essentially zero. Humans write "this is for X" and just trust the reader to infer what it's not for. LLMs have the opposite vice: they over-enumerate scenarios but draw fuzzy boundaries. After refinement, 's skill descriptions ended up with roughly three times wider discrimination margins than the human-written ones — better at matching the right task than the experts.

10:50Tyler: Okay. Now the centerpiece. Question three — compliance and fault attribution — because this is the idea I'd want every listener to walk away with even if they forget everything else.

11:03Bella: So here's how it works. An LLM reads the document and extracts its rules as a grading — and notice, the rubric comes from the skill itself, not from any external label. Each rule gets a severity . Then each rule gets judged against file-level evidence — the actual formulas in the spreadsheet, the actual formatting codes, the structural diff between input and output. Not the 's claims about what it did. The receipts. And then comes the question the paper hangs everything on. When a rule was violated, doesn't just mark it failed. It asks: was this rule precise enough to deserve following? The worked example is perfect. The skill says to highlight certain cells in yellow, and the agent uses the wrong shade — a pale buttery yellow instead of pure yellow. Same observed behavior in both scenarios. Now, scenario one: the rule specified the exact six-character color code for pure yellow. The agent had everything it needed and got it wrong anyway. That's the agent's fault — the rule is fine, preserve it. Scenario two: the rule just said "use a yellow background." Vague. The agent guessed, plausibly, and guessed wrong. That's the skill's fault — sharpen the rule, add the code. Identical failure. Opposite repairs. And the only way to know which repair is right is to ask whether the instruction deserved to be followed.

12:36Tyler: The recipe version of this is how it clicked for me. A dish comes out burnt. If the recipe said "bake at three-fifty for twenty-five minutes" and the cook eyeballed it, you don't rewrite the recipe — the recipe was fine, your cook is unreliable. If the recipe said "bake until done," the recipe eats the blame. And the deeper point is what happens to a kitchen that gets this wrong: if you rewrite every recipe after every bad dish, you will eventually ruin all your good recipes. That's the failure mode is built to prevent. Mechanically, the way they encode it is a kind of partial pardon. A rule's credit starts with how well the adhered to it, and then the portion of the failure that was the agent's doing gets restored to the . So a well-written rule that the agent ignored doesn't drag the skill's score down. And the output is two separate numbers instead of one: a compliance score that says how well the agent executed, and a skill score that says how well the instructions were written. Low compliance means the agent failed. Low skill score means rewrite the document. Those used to be one indistinguishable signal.

14:00Bella: And there's a real case in the results where this mattered. A court-form-filling task was scoring fifty percent, and the compliance diagnostics caught that the was ignoring field-ordering constraints that were actually in the . The fix wasn't to rewrite correct rules — it was to make the ones being ignored harder to miss. After refinement, that task went to a perfect score. The fourth diagnostic is quicker but it carries a nice conceptual reframe. A spreadsheet task might be solvable four different ways — a data library like , a different spreadsheet-manipulation library, Windows automation, or plain in-sheet formulas. The skill is not a program; it can't force the agent down one road. The paper's phrase is that skills act as "guidance over a broader space of possible agent behaviors." So if the agent legitimately takes road C and your skill only documents road A, the skill is operationally useless even though every sentence in it is correct. enumerates the plausible solution paths and checks that each one has some chunk of skill text covering it. A skill that brilliantly covers one of four valid paths scores low — by design.

15:19Tyler: Which encodes a real opinion about what a should be. Not a script. A map of the territory, with notes on every major route. So — results. Bella, give us the headline, and then I want to take over, because what's underneath the headline is the most interesting empirical finding in the paper.

15:39Bella: The headline: on , one single iteration of this loop — run twice, diagnose, rewrite once — lifts LLM-authored skills from about thirty-two and a half percent pass rate to about forty-two. A twenty-eight percent relative gain, closing somewhere between half and two-thirds of the gap to human-written skills, depending on how you grade. We'll get to that "depending" later, because you'll have things to say about it.

16:08Tyler: Oh, I will. But first, the decomposition. The authors split pass rate into two factors: coverage times quality. Coverage is — did the finish at all, did it produce something evaluable rather than crashing or stalling out? Quality is — given that it finished, was the output actually right? And here's the result. Among tasks that completed, correctness with skills and correctness with no skills at all are identical. Fifty-seven point one percent. Both. To the decimal. The skills contributed nothing — nothing — to whether a completed answer was correct. The entire gain, all of it, came from coverage. Without skills, the agent finished under half its tasks. With refined skills, almost three-quarters. Skills don't make the agent smarter. They keep it from tripping on the way to the finish line.

17:06Bella: That genuinely reframes what these documents are for, doesn't it.

17:11Tyler: It does, and there's a story in the appendix that makes it visceral. A task involves filling in a Word template — replacing placeholders like "candidate name" wrapped in double curly braces. The no- does the obvious thing: search for the placeholder, replace it. And it silently fails. Because Word, internally, doesn't store that placeholder as one string. It splits text across multiple internal chunks — "candi" might live in one chunk and "date name" in another — so a search for the full placeholder never matches anything. The agent isn't being dumb. It's missing a piece of arcane procedural trivia that no amount of reasoning ability reconstructs from first principles. The refined skill explicitly warns about this and says: do replacement at the paragraph level instead. The analogy I'd reach for is the brilliant consultant who flies in to fix your company's finances and fails — not because her analysis was wrong, but because she couldn't badge into the building, didn't know the file server needs VPN, and the third-floor printer jams unless you use tray two. She didn't need more intelligence. She needed the orientation packet. That's what skills turned out to be: institutional knowledge, not extra IQ.

18:31Bella: And that quietly pushes against a big assumption in research — that gaps are reasoning gaps, that the model just isn't smart enough yet. This says a large share of failure, at least for capable models, is the printer-jams-unless-tray-two layer of the world. Which is also why the LLM-authored skills scored zero originally, right? A model writing a from its general knowledge produces fluent advice about things the agent already knows. The valuable content is exactly the stuff nobody knows until something fails.

19:07Tyler: Which is the perfect setup for act two, because act two is about learning from failures at scale. Take it away.

19:14Bella: So the second experiment moves from "can you fix one for one task" to "can this run as a production flywheel." The setting is , with Excel Copilot as the — and worth noting, this is a Microsoft team, this is home turf for them. They start with no at all and stream two hundred training tasks past the system. For each incoming task, decides: is this new territory that deserves a brand-new skill, is it close to an existing skill that should be improved, or is it a near-duplicate worth skipping? Improvements per skill are capped so no skill overfits to a handful of examples. And the routing does something organic and kind of lovely: related tasks naturally cluster into broad skills. The top five skills end up absorbing three-quarters of all the routed tasks. Compare the naive baseline, which writes one narrow skill per failure pattern and sprawls out to sixty-nine documents. SkillAxe ends with twenty-two. Then the honest test. On held-out tasks, the skills are optional — the agent sees a manifest of names and descriptions and decides for itself whether to load anything, exactly like real deployment. The bare agent passes sixteen percent of tasks. With the SkillAxe library: fifty-two. More than tripled.

20:45Tyler: And now the number I find most interesting, because it complicates the headline. The naive sixty-nine- also hits fifty-two percent. Exactly. Both libraries fix the identical set of eighteen tasks.

21:00Bella: Right — so on accuracy, refinement bought nothing there; the lift came from having skills at all. What bought is compression and discoverability. Twenty-two skills doing the work of sixty-nine, and — this is the part I love — the refined skills get loaded by the nearly twice as often. About thirty-six percent of queries versus twenty. The toolbox analogy: a junk drawer of sixty-nine unlabeled items and a toolbox of twenty-two well-labeled compartments can contain the same , but you reach for the toolbox. The agent recognizes what each refined is for, so it actually uses them.

21:42Tyler: And to the authors' credit, they say this themselves, almost word for word — the primary benefit here is compression, not accuracy. Which matters operationally, by the way, not just aesthetically. That manifest of descriptions gets injected into every prompt. Fewer, sharper skills means lower cost, lower latency, and documentation that actually gets read. Anyone who's maintained a wiki knows that sixty-nine documents nobody opens is worth less than twenty-two that people do.

22:17Bella: Okay, Tyler — you've been holding your tongue since the disclosure. The critique. What should a careful reader push on?

22:25Tyler: Several things, and all of them come from the paper's own numbers, which I want to stress — this is a paper honest enough to hand you its own counterarguments. First and biggest: the headline gap-closing claim depends entirely on which grading protocol you read. The abstract says closes forty-seven to sixty-seven percent of the gap to human-authored skills. But ships with its own native scoring — human-written test suites, continuous rewards, the most conservative measure and the one most comparable to work. Under that scoring, SkillAxe gets twenty-seven point three. No gets twenty-four point nine. Human skills get forty-six. Do the arithmetic and SkillAxe closes about eleven percent of the gap. Eleven, not sixty-seven.

23:19Bella: So where do the flattering numbers come from?

23:22Tyler: From a binary pass-rate framing plus an alternative grader the authors built themselves. They call it the fair grader — a that sees the task, the inputs, and the rendered outputs, the actual spreadsheet as an image, but not the test suite, and asks: did the substantively complete the task? And look, their motivation is legitimate. The native test suites really are over-rigid — they reject answers over exact decimal , specific metadata fields, formulas that don't trigger recalculation. Solving the task a different valid way gets you a zero. The teacher-grading analogy is apt: a -reading human is more humane to legitimate alternatives than a multiple-choice answer key. But it's also more subjective. And when a paper introduces its own judge and then looks best under that judge's scoring, your eyebrow should go up. The authors flag LLM-judge bias as a limitation themselves. They know.

24:28Bella: There's also the evaluation-context issue, right? On , each gets refined using the very task it's then evaluated on.

24:38Tyler: Yes — and to be precise about what that does and doesn't mean. No ground-truth labels leak in; the loop never sees the answer key. But the refinement loop does get to watch the attempt the test task and adjust the accordingly. That's a meaningfully easier setting than improving a skill for tasks you haven't seen. The experiment exists precisely to answer that with a real train-test split — but as we said, in that experiment the refinement added zero accuracy over naive skills. And then statistical power. Two trials per task. Confidence intervals of plus-or-minus eight to ten percentage points — which is wider than the -over-raw-LLM effect itself under native scoring. The authors literally write that a third trial seed will tighten things. There's also a genuinely odd wrinkle buried in the fair-grader numbers: raw LLM skills score below no-skill there. And among completed tasks, the fair-grader quality with SkillAxe is seventy-five percent versus ninety-one without any skills. Skills push the agent to attempt more tasks, but each attempt finishes at lower per-attempt quality. "Closes the gap" glosses right over that texture.

26:05Bella: So the summary would be something like: the diagnostic framework is genuinely novel and the fault-attribution idea is sound, the controlled improvement is real but modest under the strictest scoring, and the deployment win is compression and usability rather than accuracy.

26:26Tyler: That's where I land. And I'd add: the authors' own limitations section gets most of the way there, which raises my trust rather than lowering it.

26:37Bella: Let's do those, because they're substantive. First — rule-level refinement can't catch structural misalignment. If a teaches a fundamentally wrong strategy, it can be perfectly internally consistent. The follows every rule, compliance looks great, and the dish is still wrong because the recipe was for the wrong dish. The funnel has no question that catches that. Second, skills are evaluated one at a time. In a real library, multiple skills can be active simultaneously and conflict with each other — not modeled at all. Third, the judge problem we've already covered: both compliance grading and the fair grader are LLMs, with known positional and biases. And fourth — the one I find most sobering — the broader-impacts section concedes that imperfect diagnostics could reinforce wrong procedural guidance over repeated cycles. The flywheel can spin the wrong way. If the judge misattributes fault even occasionally, the loop confidently bakes errors into persistent documents that every future run inherits. They explicitly recommend human review before deploying auto-refined skills in production.

27:53Tyler: Which is the right call, and it connects to why I think the fault-attribution idea is the piece of this paper that outlives the benchmarks. The principle generalizes way beyond skills: any self-improvement loop that treats every failure as evidence against the artifact being improved will erode its correct content over time. That applies to prompt optimizers, to memory systems, to editing their own tool descriptions — anything that maintains a persistent artifact from noisy outcome signals. Separating "the instructions were bad" from "the executor ignored good instructions" is a design principle, not a spreadsheet trick.

28:36Bella: And it pairs with the line of the paper, the one worth quoting straight: the central question is not only whether the followed the , but whether the skill deserved to be followed in the first place. There's something almost managerial about it. Good managers don't rewrite the process every time an employee makes a mistake — they ask whether the process was followable.

29:01Tyler: The other thing that sticks with me is the reframing you landed earlier, Bella — skills as institutional knowledge rather than intelligence. Because it suggests the path to better in messy real-world domains isn't only bigger models. It's agents that accumulate the orientation packet — the Word-splits-placeholders trivia, the printer-tray-two facts — from their own failures, automatically. The paper's flywheel framing is "every failed run makes the next one more likely to succeed." Aspirational, given everything we just said about judges and wrong-way flywheels. But for the first time there's a concrete mechanism behind the slogan: not just try-again, but diagnose which of four things broke, and fix that one.

29:49Bella: Fluency without function was the disease; differential diagnosis is the treatment. Whether the treatment scales past one iteration and survives its own judges — that's the next paper. If you want to dig in yourself, the paper and a few related reads are linked in the show notes. And for the full transcript — with every piece of jargon tappable for a definition, plus links to other episodes that share these ideas — head to paperdive.ai.

30:18Tyler: Thanks for spending the commute with us.

30:21Bella: This has been AI Papers: A Deep Dive. See you next time.