Training a Markdown File: When LLM Self-Improvement Borrows the Discipline of Neural Net Training
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
A team at Microsoft moved a single Markdown file between two completely different agent systems and watched spreadsheet performance jump sixty points — no retraining, no code changes. The trick is treating the prompt as a parameter and applying actual optimizer discipline: learning rates, validation gates, rejected-edit buffers, momentum. It's the difference between a chef scribbling in margins and a real test kitchen.
What you'll take away
- Why prior LLM self-revision systems mostly fail: they look like optimizers but are missing the structural ingredients — bounded step size, validation gates, persistent failure memory — that make neural net training reliable
- How a strict validation gate plus bounded edits combine to keep the rejected-edit buffer meaningful, and why removing the long-horizon machinery costs 22 points on the spreadsheet benchmark
- What the trained skill documents actually contain — specific procedural rules like 'write evaluated static values instead of relying on Excel recalculation' that fill a gap between pretrained knowledge and task instances
- Why the cross-harness transfer result (Codex to Claude Code, +60 points) is the cleanest evidence that the method captures domain knowledge rather than harness-specific syntax
- The selection-bias risk in the validation gate the paper doesn't fully address, plus the method's hard dependency on a reliable scalar reward signal
- Why small models gain disproportionately from trained skills — and the economic implication of training once on a frontier model then deploying on cheaper ones
Chapters
- 00:00The sixty-point transfer result
- 04:49The chef versus the test kitchen
- 06:58Five pieces borrowed from neural net training
- 10:27What the trained skills actually say
- 13:56Reading the empirical claims carefully
- 17:25The edit economy and why compactness matters
- 20:54Steelman critiques
- 24:23What changes if this framing catches on
References in this episode
- TextGrad: Automatic 'Differentiation' via Text — The textual-optimization framework SkillOpt directly benchmarks against, and a c
- GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning — The other main baseline SkillOpt is measured against — useful for seeing what re
- Reflexion: Language Agents with Verbal Reinforcement Learning — An early and influential entry in the self-revision ecosystem the episode situat
- Self-Refine: Iterative Refinement with Self-Feedback — Another canonical precursor in the LLM-self-improvement line the episode argues
Full transcript
Also available as a plain-text transcript page.
0:00Juniper: A team at Microsoft and a few Chinese universities took a skill they'd trained inside one AI agent system, dropped it into a completely different agent system from a different vendor — different tool names, different file conventions, different command surface — and watched performance on a hard spreadsheet benchmark jump from twenty-two percent to almost eighty-two. Sixty points of improvement. Same text file. No retraining. No code changes. They just moved the document over.
0:31Finn: And the document, by the way, is a Markdown file. A few hundred to a couple thousand tokens. Something a human can read in three minutes and understand.
0:40Juniper: That's the result that made me want to do this episode. The paper went up on arXiv on May twenty-second, twenty-twenty-six, and we're recording three days later. Quick ground rules before we dig in: what you're hearing is AI-generated. The script is from Anthropic's Claude Opus 4.7. I'm Juniper, that's Finn — we're both AI voices from Eleven Labs. Neither company is involved in producing the show. The paper is "SkillOpt: Executive Strategy for Self-Evolving Agent Skills," and the reason that sixty-point transfer number matters is that it's evidence the file isn't just a clever set of commands for one specific tool. It's encoding something more abstract — and the rest of the paper is an argument about how to produce documents like that on purpose.
1:29Finn: Okay, so where do we start? Because there are two threads tangled together here. There's the empirical thread — a frankly absurd benchmark sweep where the method wins or ties on every single one of fifty-two evaluation slots. And there's the conceptual thread, which is the part I actually find more interesting: the idea that you can take the discipline of training neural networks and apply it, almost literally, to editing a text document.
1:58Juniper: Let's start with the conceptual one, because the empirics only make sense once you've got the framing. So — the setup. You have a frontier language model. Could be GPT-5.5, could be Claude, could be something open-source. You want it to do well on some task that requires a bit of procedural knowledge — write Excel formulas correctly, navigate a household-simulation environment, answer search-grounded questions. You have three options for adapting the model. Option one: fine-tune the weights. For closed models, you can't. For open ones, it's slow and expensive. Option two: hand-write a really good system prompt or skill document. Brittle. You're guessing. Option three — and this is the interesting one — you let the model rewrite its own instructions. There's a whole small ecosystem of papers in the last two years doing this. Reflexion, Self-Refine, TextGrad, GEPA, a bunch of "EvoSkill" and "AutoSkill" variants. The pitch is basically: have the LLM look at its own failures and propose better prompts.
3:05Finn: And the dirty secret of that whole subfield, which this paper kind of openly says out loud, is that those systems mostly don't work that well. They improve things a little, sometimes, and other times they make things worse, and there's no real way to know in advance which it'll be.
3:24Juniper: Right. And the SkillOpt authors look at that landscape and they ask a really pointed question. They say: these systems are all in the rough shape of an optimizer. There's a parameter — the prompt. There's feedback — task scores. There's an iteration loop. So why do none of them behave like an actual optimizer in the engineering sense? Where's the learning rate? Where's the validation set? Where's the momentum? Where's the rejection of bad updates?
3:54Finn: This is the move I want to highlight, Juniper, because it's not just a rhetorical flourish. They're not saying "wouldn't it be cute to use deep-learning vocabulary." They're saying: there is a specific list of structural ingredients that make neural network training reproducible and reliable, and the LLM-self-revision literature is missing every single one of them. Let's put them back in. One by one. With text instead of weights.
4:21Juniper: And here's the analogy I keep coming back to. Imagine the difference between a chef who occasionally scribbles ideas in the margin of a recipe — sometimes the dish gets better, sometimes worse, and there's no log of what didn't work — versus a test kitchen. A test kitchen tries a variation, tastes it against a control batch, throws it out if it scored worse, and keeps a notebook of failed experiments so nobody re-runs them. Both are "iterating on the recipe." Only one is actually training the recipe.
4:53Finn: That's the whole episode, basically. The chef versus the test kitchen.
4:57Juniper: So let me walk through the test kitchen. There are five pieces, and each one is borrowed from how you train a neural network. The first piece is the setup itself. SkillOpt has two language models, and you have to keep them straight. There's a student — the model you actually want to do the task. And there's an optimizer — a separate model whose job is to watch the student work and propose edits. The student runs the task. The optimizer rewrites the document. They never swap roles. The student never modifies its own instructions. The optimizer never executes the task.
5:33Finn: Think coach and athlete. The coach watches the matches and updates the playbook. The athlete reads the playbook and goes to compete. The athlete never watches their own footage and rewrites their own strategy in the middle of the game.
5:48Juniper: Exactly. And the playbook is what travels. If the coach can't come to the next tournament, the playbook still works. That's why the transfer results are possible — but we'll come back to that. So, piece one of the optimizer machinery: bounded step size. A learning rate. In neural network training, your learning rate controls how far you move at each step. Too big, you overshoot and oscillate. Too small, you crawl. SkillOpt has a literal numerical cap on how many edits the optimizer is allowed to apply to the skill document in a single round. Default is four. Then it decays down to two as training goes on. Big strokes early, small refinements late. Exactly the same shape as the cosine schedules you'd use training a transformer.
6:35Finn: And the reason this matters is more subtle than it sounds. If you let the optimizer rewrite the document however much it wants each round, you destroy the value of the next thing on the list. Which is — Juniper, this is your favorite part, I think — the rejected-edit buffer.
6:53Juniper: It is my favorite part. So. Piece two: the validation gate. In machine learning, you split your data three ways. You train on one chunk, you select between candidate models on a second chunk — the validation split — and you report on a third chunk you never touched. SkillOpt does this in an unusually strict way. Every time the optimizer proposes a set of edits, you apply them, you get a candidate skill, and you run that candidate on a held-out validation split. If the candidate strictly improves the score, you accept it. If it ties or regresses, you reject it. No drift. No "well it's about the same, let's go with the new version."
7:35Finn: This is the piece I think most of the prior work was missing, and it's almost embarrassing in hindsight. You have an LLM that's eloquent. It writes a beautifully-argued explanation of why this new edit will help. The eloquence is convincing. The edit makes things worse. Without a gate, the bad edit lands anyway because the diagnosis sounded reasonable.
7:57Juniper: Right. The analogy here is a code review with a strict CI gate. Your patch has to make at least one test pass that wasn't passing before. The reviewer's prose argument doesn't count. Most self-improving LLM systems are like a team that merges every plausible-looking patch because the author argued convincingly. SkillOpt is like a team that won't merge anything red.
8:19Finn: Okay, and now the rejected-edit buffer, which is where it gets clever.
8:23Juniper: Piece three. When an edit gets rejected, SkillOpt doesn't just throw it away. It stores it. The edit, and the score drop it caused, gets appended to a buffer. And the next time the optimizer is asked to propose new edits, it sees that buffer. So the optimizer learns, in-context, "don't propose this kind of change again, it cost us four points last week."
8:45Finn: And this is where the bounded step size becomes structurally important. Because the buffer is only useful if the next version of the skill document is close enough to the previous one that the negative feedback still applies. If you let the optimizer rewrite the whole document each round, the rejected edits from last round are commentary on a document that no longer exists. The buffer becomes irrelevant. The cap on step size is what keeps the document close enough to itself across iterations that the memory of failure stays meaningful.
9:17Juniper: That's the load-bearing point. The two pieces are coupled. Bounded edits plus persistent failure memory. Neither one works without the other.
9:26Finn: And the brilliant part of the buffer, the part I want to flag, is that it only exists at training time. The deployed skill document — the one that ships — doesn't carry the buffer. It ships clean. The training-time scaffolding stays in the lab.
9:42Juniper: Which gives you the lab notebook analogy. A researcher keeps a notebook of failed experiments — "tried adding this reagent, yield dropped twelve percent" — and the next experimenter reads it before they design their next run. The notebook is for the lab. The published protocol is clean.
10:02Finn: Piece four — and this one's analogous to momentum.
10:05Juniper: Yeah. Pieces one through three operate at the level of individual optimizer steps. But neural network training also has longer-horizon machinery. Momentum. Schedules. Things that aggregate information across many steps. SkillOpt has an epoch boundary, and at the boundary it does what they call a slow update. It replays the same training tasks under last epoch's skill and this epoch's skill, side by side. It categorizes the differences. Where did we improve? Where did we regress? Where do we keep failing for the same reasons? Where are we consistently strong? And then it takes those longer-horizon lessons and writes them into a protected region of the skill document. A region that subsequent step-level edits aren't allowed to touch.
10:54Finn: So the document ends up with two zones. There's a fast zone, where individual-round edits live and get rewritten constantly. And there's a slow zone, written only at epoch boundaries, encoding things that have shown up persistently across many rounds. That's the momentum.
11:13Juniper: That's the momentum. And piece five — the optimizer also keeps a private notebook to itself. Advice from the optimizer to the optimizer about what kinds of edits tend to help. The student never sees this. It's the optimizer's internal craft manual. The paper calls it a meta skill, but you can just think of it as the optimizer's working notes about its own job.
11:37Finn: So if you zoom back out, what you have is a feedback loop with bounded steps, a strict validation gate, persistent negative-example memory, an epoch-level consolidation pass, and a private optimizer-side workbook. Every one of those is borrowed from neural network training, more or less directly, and every one of them is something the prior LLM-self-revision systems were missing.
12:01Juniper: And the document that comes out the other end — let me make this concrete, because this is where the paper really earns its keep — the document that comes out is not a giant rambling thing. The skills they train end up between three hundred seventy-nine tokens and just under two thousand. The median is about nine hundred twenty tokens. So roughly between a paragraph and a few pages.
12:25Finn: A few pages of what, though? This is the question. Because if it's just "be careful," "think step by step," "double-check your answer" — that's not interesting. The paper actually shows us what these documents look like, and that's where the abstract optimization story gets really concrete.
12:42Juniper: Yeah, you want to read one?
12:44Finn: Let me read the one I think is delightful. On the spreadsheet benchmark — this is multi-turn agentic code generation with a real Python and Excel runtime — the optimizer eventually wrote this rule into the skill, and I'm quoting verbatim: "Inspect workbook structure and formulas, then write evaluated static values across the full requested target range instead of relying on Excel recalculation."
13:08Juniper: That's such a specific failure mode. The agent was writing Excel formulas. The grader was reading the cell values. Formulas in unopened cells don't get evaluated. The grader sees an empty string where a number should be. The agent thinks it solved the problem.
13:25Finn: And the rule that fixed it is something a thoughtful spreadsheet engineer would absolutely write into a runbook. It's specific, it's actionable, it generalizes across instances of the benchmark, and a human can read it and immediately understand both what it's telling the agent to do and why.
13:43Juniper: There's another one I love, from the household-task benchmark. The agent has to do things like "put the apple on the desk" in a simulated apartment, and it has to plan over up to fifty steps. The rule the optimizer wrote was — let me get this right — "Keep a horizon-aware visited and frontier ledger, diversify search after repeated same-type failures, and avoid revisiting the destination until holding the target."
14:08Finn: Which is essentially compiled folk wisdom about exploration. Track where you've been. If the same kind of failure keeps happening, try something different. Don't keep going back to the goal location until you actually have what you're supposed to deliver. None of which the model was doing zero-shot.
14:26Juniper: And here's the thing that I think is the actual payoff. These rules are exactly the kind of procedural knowledge that frontier models lack — not because they're not smart enough to follow them, but because nobody ever wrote them down in the training data. The skill document is filling in a layer of expertise that lives somewhere between the model's pretrained capabilities and any specific task instance. And SkillOpt is producing that layer automatically, through a training loop, instead of asking a human practitioner to write it.
14:59Finn: Okay so let me push on the empirics, because the headline numbers in this paper are aggressive and I want to be careful about how we frame them. The biggest claim — and this is the one that should make any reader squint — is that SkillOpt is best-or-tied on every single one of fifty-two evaluation cells. Six benchmarks, seven different target models from the very large down to a four-billion-parameter Qwen, three different agent harnesses. Fifty-two for fifty-two.
15:29Juniper: How much should we trust that?
15:31Finn: Two things. The honest framing is that "best or tied" is doing some work — ties are getting counted as wins. So in some cells SkillOpt is co-leading rather than dominating. The second thing is that the comparison baselines, GEPA and TextGrad and a few others, are being evaluated under controlled conditions — the same target model, the same held-out test split, and the same scorer as SkillOpt. That's defensible, but it means we're not seeing those baselines under whatever protocol their original authors used. So a strict reading is: under controlled conditions, SkillOpt wins or ties everywhere; under each baseline's home-court conditions, the picture might be slightly tighter.
16:14Juniper: The honest takeaway is probably something like: the dominance is real, and it's striking, but the magnitude of the gap over the best baseline is more modest than the gap over no-skill-at-all. The biggest model in the study averages about a twenty-three-point gain over the no-skill baseline. Over the best alternative skill optimizer? More like five points.
16:37Finn: Right. And five points across the board is still a substantial result. But "twenty-three" is the number that grabs headlines, and that's against the wrong baseline if you're trying to evaluate the contribution of the optimizer discipline specifically.
16:53Juniper: That's a fair refinement. So what's the cleanest result, in your view?
16:57Finn: The transfer experiments. Because transfer is harder to game with protocol choices. You train a skill once, you ship it somewhere new, you don't optimize anymore. Either it works in the new environment or it doesn't. And this is where we get back to the cold open, Juniper. The spreadsheet skill trained inside Codex, dropped into Claude Code with no further changes, lifts performance by almost sixty points. Two harnesses from two different vendors with two different sets of tool APIs.
17:26Juniper: And the only way that result is possible is if what got trained into the skill is the spreadsheet domain knowledge, not the harness specifics. The rule about evaluated static values versus formulas — that's true regardless of whether you're calling a tool named "execute python" or "run command." The harness-specific syntax is something the model already knows how to handle in either environment. What it didn't know was the domain procedure.
17:53Finn: And the paper has cross-model transfer too. A skill trained on the strongest model still helps a weaker model, including a four-billion-parameter open-source one. That's the result with the most interesting economic implication, I think. Because if you can train a skill once on the best model you can afford, then deploy it across a fleet of cheaper smaller models, you're substituting offline optimization compute for online inference compute. That trade gets very favorable at scale.
18:22Juniper: And the small-model results have their own surprising shape. The gains are big — the paper reports an average improvement north of twenty-six points on one of the smaller models, and on individual benchmarks you see the score more than double or even triple from the no-skill baseline. The story the authors tell — which I find pretty compelling — is that compact skill documents can supply procedural knowledge that small models just don't have internalized in their weights. The big model knows the procedures already and is using the skill mostly as a focusing device. The small model is genuinely learning, in context, things it didn't know.
19:04Finn: Like the difference between a senior engineer reading a checklist as a reminder, versus a junior employee reading the same checklist and actually learning the procedure from it.
19:16Juniper: That's the right shape. Where the analogy stretches is that real expertise has a lot of judgment that resists being written down. So you shouldn't push it too far — a trained skill plus a small model isn't going to fully close the gap with a frontier model on tasks that require genuine reasoning depth. But on tasks where the bottleneck is procedural knowledge specifically, it can close a lot of the gap.
19:42Finn: I want to spend a minute on what I think is the most surprising operational detail in the paper, which is the edit economy.
19:51Juniper: Tell me.
19:51Finn: The optimizer proposes many edits per epoch. Many, many. The validation gate rejects most of them. The final deployed skill — the one that's actually shipped — has accepted a tiny number of edits over the whole training run. On the question-answering benchmark, it's one accepted edit, for a thirty-nine-point gain. On the math benchmark, one accepted edit, twenty-nine points. The most heavily edited one, spreadsheets, has four accepted edits. Median across all six benchmarks: two and a half accepted edits.
20:25Juniper: That's the validation gate doing real work. The optimizer is generating a lot of plausible-sounding suggestions, and the gate is throwing most of them away.
20:35Finn: And it's a really nice illustration of why the gate matters. Without it, you'd accumulate all those plausible-sounding edits. The document would get long, the signal-to-noise would degrade, and the gains would either flatten out or reverse. The compactness of the final artifact — between a paragraph and a few pages — is a direct consequence of the gate being strict.
20:59Juniper: One ablation in the paper makes this concrete, Finn. They remove the rejected-edit buffer and the slow update, run training again, and on the spreadsheet benchmark the score drops from seventy-seven and a half to fifty-five. That's a twenty-two-point degradation just from taking out the long-horizon machinery. The optimizer-discipline pieces aren't decorative. They're doing the work.
21:23Finn: Now, let me voice the steelman critique here, because the paper has a confident, almost triumphalist tone — fifty-two for fifty-two, dominance everywhere — and a careful reader should push back in a couple of places.
21:36Juniper: Go ahead.
21:37Finn: First thing. The validation gate. We've been saying it's the secret weapon. And it is. But there's a circularity problem the paper doesn't fully wrestle with. The gate accepts edits that improve performance on the selection split. The paper reports final results on a different test split, which is good — they're not testing on the training set. But the optimizer is running many proposal rounds, and the selection split is being optimized against. Heavily. There's an implicit selection bias risk: if you make enough proposals against a fixed validation sample, you start finding things that look like genuine improvements but are partly fitting to the validation sample's quirks.
22:20Juniper: And the way you'd address that is to swap in a fresh validation draw partway through training and see how much of the gain holds up.
22:28Finn: Right. The paper shows that test scores track selection scores reasonably well across epochs, which is reassuring but not dispositive. I'd want to see the fresh-draw experiment before I treated the dominance result as bulletproof.
22:42Juniper: Second concern?
22:43Finn: The reward signal dependency. Everything in this method assumes you have a reliable scalar score for each task. Exact match. An executable verifier. A hard grader. SpreadsheetBench has a runtime that checks cell values. ALFWorld has a simulator that knows whether the apple is on the desk. Math problems have right and wrong answers.
23:04Juniper: And in the open-ended generation case — creative writing, advisory dialogue, anything where quality is genuinely subjective — there is no validation gate. The whole stabilizer the paper relies on disappears. The authors acknowledge this, to their credit. But it's a real restriction on the method's domain.
23:22Finn: And the third critique. The optimizer needs to be reasonably strong. The paper has an experiment where they replace the frontier optimizer with one matched to the student model, and they show the loop still recovers a chunk of the gain. But it's a chunk, not the whole thing. Roughly fifty-six to seventy-four percent of the original improvement. The interpretation the authors offer — that the loop is doing more than distilling the optimizer's knowledge — is fine as far as it goes. But it also means that a meaningful slice of the improvement does depend on access to a strong optimizer model. Which is fine for teams that have that access. Less fine for teams that don't.
24:03Juniper: And the cost picture is uneven. On the easiest benchmarks, you're spending under a million tokens of training per point of test gain. On the harder ones — the long-trajectory question-answering benchmarks especially — it's forty to fifty million tokens per point. Which is paid once, offline, before deployment. So it's worth it for repeated use. But for a one-off task, you might be better off with few-shot prompting and saving the optimization budget.
24:32Finn: Juniper, what's your read on the bigger picture? If this framing catches on, what changes?
24:37Juniper: I think the interesting move is treating the prompt — or rather the skill document — as a first-class optimizable object. The whole prior literature on prompt engineering treats the prompt as something you craft. An artisan's product. This paper says: no, it's a parameter, and you should train it. And once you take that move seriously, all the rest of the deep-learning toolkit starts to suggest itself. Regularization on the skill document. Curricula over the training tasks. Transfer learning, which the paper is already showing works. Libraries of skills that compose, which the authors flag as future work.
25:16Finn: And the deployment story is genuinely different from fine-tuning. A fine-tuned model is opaque. You can't read it. You can't audit it. You can't edit it by hand if you spot a problem. A trained skill document is a Markdown file. You can read the whole thing in three minutes. If something looks wrong before you go to production, you can just delete that rule.
25:39Juniper: Which is a property I think is going to matter more as agents take on higher-stakes work. The artifact is auditable. The optimization is reproducible. You can ship one to production with the same kind of confidence you'd ship any other piece of documented engineering work, because that's basically what it is — engineering documentation, except produced by a training loop rather than a human.
26:03Finn: I'll register one more skeptical note before we wrap, which is — the paper is doing this in a regime where everything is set up to work. Closed scalar rewards, well-defined benchmarks, established harnesses, a strong optimizer model available. The question of how this generalizes to messier real-world deployment, where the reward signal is noisy or partial or human-judged, is genuinely open. The authors say so. The honest reading is that this is a clean, well-supported result in a constrained setting, and the next several years of research will tell us how much of it survives contact with deployment.
26:42Juniper: That seems exactly right. Though I'd add — the proof of concept is strong enough that the framing alone is going to shape work for a while. Even if the specifics of SkillOpt don't survive, the move of saying "let's actually be disciplined about this, the way the rest of optimization is disciplined" — that move is going to stick. Because it's obviously right, in retrospect.
27:06Finn: Right. The kind of paper where you read it and you think, why wasn't everyone already doing this?
27:13Juniper: That's the test kitchen, basically. The chef has been scribbling in the margins for two years. Somebody finally said, what if we did it like a test kitchen instead. And the test kitchen, it turns out, works.
27:26Finn: Show notes have the paper and a few related reads if you want to keep pulling on this — TextGrad, GEPA, the whole little ecosystem of recent prompt-and-skill optimization work that this paper is in conversation with.
27:39Juniper: And if you want the full transcript with definitions baked in, plus the cross-links to other episodes that touch these ideas, that's all at paperdive.ai. Every term we used has a glossary entry.
27:52Finn: Thanks for listening to AI Papers: A Deep Dive.