All episodes

Episode 168 · Jun 24, 2026 · 27 min

When Turning Experience Into Code Makes Your AI Agent Dumber

Dai, He, Li et al.

LLM Agents

AI Papers: A Deep Dive — Episode 168: When Turning Experience Into Code Makes Your AI Agent Dumber — cover art

paperdive.ai

Listen

Ep. 168

When Turning Experience Into Code Makes Your AI Agent Dumber

0:00

27 min

Concepts in this episode

Agentic AI Training Methods Agent Memory Agentic Workflows Tool Use ReAct Agent Self-Correction Iterative Training Ablation Studies Agent Benchmarks Knowledge Distillation Trajectory Quality Emergent Behavior In-Context Learning

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Metis: Bridging Text and Code Memory for Self-Evolving Agents

Venue

arXiv:2606.24151

Year

2026

Read the paper

arxiv.org/abs/2606.24151

Also available on

Apple Podcasts Spotify

An AI agent that distilled its hard-won experience into reusable code scored ten points worse than an agent with no memory at all. This episode unpacks why the sophisticated-looking move — freezing lessons into callable tools — is also the fragile one, and what the right fix turns out to be. You'll come away understanding the single most basic decision in building agents that learn on the job: when a lesson should stay as soft advice, and when it's earned the right to become code.

What you'll take away

Why storing an agent's experience as callable code can drop it below an agent with no memory at all — a 22-point collapse the moment it has to generalize
The 'injection asymmetry': text is consumed as adaptable advice you filter through reality, while code is a trusted black box whose flaws propagate to every caller and suppress the agent's own recovery behavior
Metis's 'text first, code earned' policy — sorting experience into plans, facts, and pitfalls, and crystallizing only recurring plans into tools using the desire-path principle
Why the codifier deliberately never reads the messy trajectory, building tools from the clean query pattern instead — and how that lets even failed runs safely count toward codification
The ablation that proves the recurrence gate: an 'Eager' version cost 47% more to build, scored worse, and left over half its tools never invoked
Where the clean story has a seam: the headline result is really about ungated, trajectory-trained, unvalidated code on a single benchmark — not a law that 'code memory is bad'

Chapters

01:57The brilliant employee with amnesia
03:01Text advice or a black-box tool?
04:50The experiment that fixed every variable
08:43The 22-point collapse
10:06Why the confident tool fails hard
13:07Paving only the paths people walk
18:13Does the machinery actually pay off?
21:41The seam in the clean story
24:30Don't pour the concrete too early

References in this episode

ReAct: Synergizing Reasoning and Acting in Language Models — The think-act-observe loop the episode names as the baseline floor every memory
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents — The exact 457-API simulated benchmark all of the episode's accuracy and token nu
Voyager: An Open-Ended Embodied Agent with Large Language Models — The canonical 'agent builds a reusable skill library of callable code' approach
Generative Agents: Interactive Simulacra of Human Behavior — A contrasting take where agent experience is stored and retrieved as natural-lan

Full transcript

Also available as a plain-text transcript page.

0:00Juniper: An AI agent solves a hard task, learns from it, and — being clever — turns that lesson into a reusable tool so it never has to think it through again. Then that tool makes it worse than if it had learned nothing at all.

0:15Tyler: Quick heads up before we get into it — this is an AI-made explainer, both voices included.

0:21Juniper: And that opening isn't a hypothetical. It's a measured result. There's a setting in this paper where an agent that stored its experience as callable code scored fifty-three percent. The same agent with no memory whatsoever scored sixty-three. The "smart" upgrade cost it ten points.

0:39Tyler: Which is the kind of result that should not happen. The whole premise of giving an agent memory is that remembering helps. So by the end of this you'll understand exactly why the more sophisticated-looking option — turning experience into software — is also the more fragile one, and what the right fix turns out to be.

1:00Juniper: And here's why this matters beyond one benchmark. The next wave of AI agents is supposed to learn on the job — get better the more they're used. This paper is about the single most basic decision in that whole program: when your agent figures something out, how do you store it so future-you actually benefits? Almost everyone has been answering that by taste. This is the team that ran the experiment.

1:26Tyler: The paper is called Metis, out of the Chinese University of Hong Kong and Huawei, and it does two separable things — a diagnostic study that surfaces something genuinely surprising, and a system built as the answer. We'll take them in that order, because the answer only makes sense once you feel the problem.

1:46Juniper: So start with the problem the whole field is circling. Today's language-model agents are stateless. Picture a brilliant employee with severe amnesia — genuinely sharp every morning, and every evening it all evaporates. Whatever an agent figured out solving your task today is gone the moment that conversation scrolls out of its context window. Tomorrow, similar task, it starts from zero. Re-derives, re-explores, makes the same mistakes, burns the same tokens.

2:19Tyler: And the agent we're talking about isn't a chatbot spitting out one block of text. The mental model is a language model sitting at a terminal — it reasons a step, runs a command or calls an API, reads what came back, reasons again, repeats until the task is done. That think-act-observe loop is the ReAct pattern, and it's the floor everything in this paper is measured against. The question the paper asks is narrow and deep at once: what do you feed back into that loop from past tasks, and in what form?

2:55Juniper: Because there's a fork, and almost nobody has examined it. Option one — store the lesson as text. Natural-language notes you paste back into the prompt next time. "When booking a flight, check the user's saved payment methods first." Option two — store it as code. An actual function the agent can call, hand it some arguments, get a result, and skip re-deriving the whole procedure.

3:22Tyler: And the crucial thing — this is subtle and easy to get wrong — the difference isn't prose versus programming. It's how the agent *consumes* the thing. Text is advice the agent reads and interprets. It can adapt it, partially apply it, or ignore it if it doesn't fit. Code is a black box the agent invokes and trusts. Same underlying lesson, two completely different relationships to it. One is a suggestion. The other is a command the agent runs verbatim.

3:54Juniper: And the line the authors keep coming back to is that this choice "is typically made at design time rather than derived from the characteristics of the experience itself." A system architect just decides up front, we're a text shop, or we're a tool-building shop. Nobody asked which one is actually better, and for what.

4:17Tyler: So before we get to their answer, let me flag the thing we'll come back to at the end. What they actually establish is that one *specific flavor* of code memory is brittle — ungated, distilled from a single run, with no guardrails. Whether "code memory" in general is doomed is a bigger claim, and that gap is going to matter. Hold onto it.

4:41Juniper: Fair flag. But let's earn the surprise first. The diagnostic. What makes this study clean is they fixed everything that usually muddies these comparisons. Same benchmark — AppWorld, a simulated world of nine apps exposing four hundred and fifty-seven APIs, where the agent has to actually change the state of the sandbox to pass, not just produce nice-sounding text. Same stream of tasks. And a sharp methodological move — they split the roles across two models. A cheaper model, GPT-4o, is the executor that *uses* the memory. A stronger model, Claude Sonnet 4.6, is the reflector that *builds* the memory.

5:24Tyler: And that split is doing real work. It separates the cost of *constructing* knowledge from the cost of *consuming* it. If one model did both, you couldn't tell which side of the ledger an effect was coming from. So they hold the experiences identical and vary one knob only — stored as text the agent reads, or code the agent calls — and then measure three things.

5:51Juniper: First axis: construction cost. How expensive is it just to build the memory? Text wins, and not by a little. Building the code memory took about five hundred and sixty model rounds of back-and-forth. The text version took two hundred and twenty. Roughly two and a half times the interaction.

6:12Tyler: Which makes total sense once you picture the work. Writing a text note is reading the experience once and jotting it down. Building a code tool is software engineering — draft it, run it in a sandbox, read the API docs, hit an error, debug, retry, until it's trustworthy. The expensive part of shipping code was never the typing. It's the test-and-debug cycle.

6:38Juniper: And the authors are careful here, which I appreciated. In raw tokens the gap was only about one-point-three times — much milder. They explicitly say the *rounds* are the sharper signal, because rounds capture that debugging depth. They don't oversell the token number.

6:57Tyler: Second axis flips it, though.

6:59Juniper: It does. Execution efficiency — how cheaply does the agent run *once it has* the memory. Code wins decisively. A callable tool collapses a whole multi-step procedure into one invocation that skips the reasoning entirely. On the tasks both methods commonly solved, code cut execution tokens by fifty-four percent. Text only cut them thirty-eight. So if you just stopped here, you'd conclude: code is pricey to build, but it pays for itself at runtime. Build it once, run it cheap forever.

7:34Tyler: And that's exactly the intuition the third axis is about to detonate.

7:40Juniper: Third axis: transfer reliability. Does the memory still help on tasks it wasn't built from? And to measure that honestly they ran the same memory under two regimes. Call the first one Oracle. You build the memory from all ninety tasks, you hand-audit it — they even manually deleted one tool they could see was broken — and then you test on those same tasks. Best case. In-sample. The optimistic ceiling.

8:09Tyler: Studying the exact questions that'll be on the exam.

8:13Juniper: Right. The second regime is Streaming, and it's the realistic one. Tasks arrive in order, and each task can only use memory distilled from strictly *earlier* tasks. No peeking ahead. Memory built from the past, applied to a genuinely new present. And the gap between Oracle and Streaming is precisely how much of the memory's benefit is real generalization versus an illusion of fitting the test.

8:39Tyler: So show the bars. Because this is the moment.

8:42Juniper: Under Oracle, code memory looked fantastic — clean, fast, accurate. Then you switch to Streaming. Text memory drops a little — about five and a half points — and stays comfortably ahead of everything, up at seventy-three percent. Code memory drops twenty-two points. And twenty-two points down lands it at fifty-three — below the agent with no memory at all, sitting at sixty-three.

9:09Tyler: So let that fully register. The agent did the expensive, sophisticated thing. It distilled its experience into real reusable software. And the result wasn't a smaller benefit, or a wash. It actively dragged the agent below the version that remembered nothing. Building the tools made it dumber.

9:29Juniper: And that's the headline reversal. Code that looks great in-sample collapses the instant it has to generalize. Text degrades gracefully. The two forms don't just differ in degree — they fail in opposite *characters*.

9:44Tyler: Which is the real question, and it's the best idea in the paper. *Why* does code fail this way? Why doesn't text? And the authors give it a name — the injection asymmetry — and honestly they hand you the perfect analogy for it.

9:59Juniper: Go for it, because this is your thread, Tyler.

10:03Tyler: So think about two coworkers. The first one gives you advice. "Hey, when you handle these refund tickets, double-check the account's currency first." You read that, you weigh it against what's actually in front of you, and you adapt it — or you ignore it if this ticket's different. That's text. It's a suggestion you filter through reality. The second coworker hands you a finished black-box widget and says, "don't think about it, just run this, trust the output." That's code. And here's the asymmetry. If that widget was built from one flawed example, it propagates the defect to every single person who calls it. The paper's phrase is that a faulty tool "propagates its defect to every caller."

10:52Juniper: And it's worse than just passing along a bug.

10:55Tyler: Much worse, and this is the part that got me. The mere *presence* of a confident-looking tool suppresses the agent's own recovery behavior. When the agent reads a text tip and the world looks wrong, it keeps checking, keeps self-correcting — the tip was only ever advice. But when it calls a tool, it trusts the return value. If the tool quietly hands back garbage, or an empty result, the agent takes that at face value and stops investigating. It's the confident colleague who's wrong — and because they sound authoritative, everyone stops double-checking the situation themselves.

11:38Juniper: So the very thing that makes code efficient — that it lets the agent stop reasoning — is the thing that makes it brittle. Skipping the reasoning is the feature *and* the bug.

11:50Tyler: Exactly. Text fails soft because it's consumed as adaptable advice. Code fails hard because it's consumed as trusted control flow. The consumption mode *is* the mechanism. That's the whole insight, and it's why you can't fix code memory just by writing better tools — the fragility is structural to how a black box gets used.

12:13Juniper: Okay. So here's the checkpoint before we build anything. We've got three facts on the table. Text is cheap to build and transfers reliably but runs slow. Code is expensive to build and runs fast but is dangerously brittle under any shift. And the cause of the brittleness is that code is trusted blindly while text is filtered through reality. Now — the system. Metis is just those three facts turned into a policy, and the policy fits in four words: text first, code earned.

12:47Tyler: This next stretch is the design — three choices, each one a direct counter to a failure we just diagnosed — and the payoff is a single rule that decides when an agent should stop thinking and just run the script. Let the diagram carry the structure; we'll carry the why.

13:07Juniper: First choice. Don't treat text as one flat playbook. Metis sorts text experience into three flavors, because they have different jobs. Plans — reusable procedural templates, the "how to do this kind of task" recipes. Facts — environment constraints, the "this app requires a login token" kind of thing. And pitfalls — stored as little trigger-mistake-consequence packets: when *this* situation comes up, *this* is the error you'll be tempted to make, and *here's* what it costs you.

13:41Tyler: And the load-bearing move there is that only *plans* describe something mechanically executable. A fact isn't a procedure. A pitfall isn't a procedure. So only plans are even eligible to become code later. The taxonomy is secretly a filter for what can be crystallized.

14:00Juniper: Which is the second choice, and it's the heart of the system. The recurrence gate. A plan does not become code the first time it works. It only gets promoted into a tool after it's been selected across enough distinct tasks to prove it's a stable, recurring pattern.

14:19Tyler: And the analogy the authors basically live inside here is desire paths. You know how cities decide where to lay sidewalks? The smart ones don't pour concrete the moment one person cuts across the grass. They wait and watch where the dirt trails actually form from repeated foot traffic — and then they pave those. Pave too early and you've spent concrete on a route nobody uses, and locked in a layout that might be wrong.

14:47Juniper: That's it exactly. Building a tool costs two and a half times what a note costs, and a tool only pays off if it's reused. So you wait for evidence of repeated use before you pay the construction bill. The threshold is the dial — set it high, you almost never codify; set it low, you're back to building junk eagerly.

15:08Tyler: But wait — here's where I'd expect it to break. If you only count tasks the plan *succeeded* on, you'd codify really slowly, right? You'd be throwing away half your signal.

15:20Juniper: That's the clever bit. The task doesn't have to succeed to count toward the gate. Even a failed run signals "this query belongs to the same procedural family." It still tells you the strategy keeps coming up.

15:34Tyler: Which only works because of the third design choice — and this is the one I think is genuinely sharp. When Metis finally codifies a plan, the codifier *never reads the trajectory*. It doesn't look at the actual run.

15:48Juniper: Say more, because that sounds backwards. The trajectory is the richest record you have.

15:54Tyler: It's too rich. A trajectory is full of incidental junk — failed attempts, task-specific constants, one-off variables, the exact account number from that one run. That's great for diagnosing a single execution and *terrible* as the blueprint for a general black-box tool. Remember the injection asymmetry — a tool distilled from a flawed trajectory propagates the flaw to everyone. So the codifier gets only three things: the abstract plan, the set of *queries* that triggered it, and a live sandbox. The plan gives the procedure, the queries define the variation the tool has to cover, the sandbox grounds it in reality.

16:41Juniper: And that's *why* failed tasks can safely count toward the gate. Since the codifier only consumes the query, not the messy trajectory, a failed run adds coverage of the input space without poisoning the tool with the failure itself.

16:58Tyler: Right — the gate and the query-only codification are the same idea from two sides. One says "wait until it recurs," the other says "and when you build it, build from the pattern, not the mess." Together they're the structural fix for the brittleness we measured.

17:18Juniper: There are two smaller pieces worth one sentence each. Dependency closure — when the agent grabs a tool, the system automatically pulls in every helper function that tool calls, and every helper those call, until nothing's missing. Batteries included in the box, so the lightweight model selecting tools never has to reason about internals. And reflection — the whole memory-building process — runs off the critical path, on finished trajectories, asynchronously. So the expensive part never slows down what a user actually waits for.

17:58Tyler: So the full object, in one breath: text by default, sorted into plans, facts, and pitfalls; plans get crystallized into tools only after they've proven they recur; and the crystallization works from the clean pattern, not the dirty run. The question now is whether all that machinery actually buys anything.

18:21Juniper: And this is where I want to be careful to frame it as a prediction first. If the theory is right — if text-first-code-earned is actually the correct policy — we should see two things. The agent should get *more accurate*, because the brittle artifacts that dragged it below baseline are gone. And it should get *cheaper to run*, because the tools that survive the gate are exactly the ones worth invoking. More accurate and cheaper at the same time. And that's what they get.

18:53Tyler: Give the numbers.

18:55Juniper: Against a no-memory agent, Metis comes out up to about twenty percent more accurate while cutting execution cost by almost a quarter — twenty-two-point-eight percent on one split. In absolute terms, it's about eight points of accuracy on the split that tests distribution shift, and about eleven points on the split that tests recurring workloads — and notice which gap is bigger.

19:21Tyler: The recurring-workload one. Which is striking, because that's the easy case — and even there, building tools the right way is worth eleven points over remembering nothing.

19:32Juniper: And there's a nice villain-foil in the lineup. One pure text-based baseline, ACE, actually *inflated* execution cost up to three-point-eight times — nearly quadrupled what the agent had to spend — for a tiny accuracy bump. So Metis isn't just beating "no memory." It's beating the failure mode where memory becomes bloat the agent has to wade through.

19:56Tyler: But the result that actually proves the recurrence gate — the desire-path principle — isn't the headline. It's the ablation. And this is the number I'd keep.

20:08Juniper: This is the one that turns the whole philosophy into something you can feel. They built an "Eager" version of Metis — same system, but it codifies after *every* task instead of waiting for recurrence. Pave the path the instant one person walks it. And the Eager version cost forty-seven percent more to build — eleven-point-six million tokens versus seven-point-nine. It scored three and a half points *worse*. And the kicker is in the head-to-head on the identical dev workload: only forty-one percent of Eager's tools were ever invoked, against fifty-six percent for the full gated system. A fifteen-point gap on the same tasks. And given a larger workload, the gated system's tools get exercised even harder — ninety-four percent of them got used.

20:59Tyler: So eager codification spent half again as much money to manufacture a library where, even on its home turf, more than half the shelf is pavement leading nowhere — tools the agent never once calls. The gate isn't primarily a cost-saver. It's a *quality filter*. The recurrence requirement is what keeps the library full of paths people actually walk.

21:24Juniper: And one more contrast I liked, on the retrieval side — every single Metis test task got at least one relevant piece of memory back. With one of the skill-building baselines, about twenty-one percent of tasks retrieved nothing and had to start from scratch. So coverage, not just quality.

21:44Tyler: Okay. Now I have to do the thing this channel does, because the clean story has a seam in it, and it's the one I flagged at the top.

21:53Juniper: The "code is bad" framing.

21:55Tyler: Right. Because the most quotable result here — code memory falls below no-memory — is true, but it's not the whole truth, and the paper itself is honest enough to tell you so. That dramatic collapse came from a deliberately *minimal* code setup: ungated, distilled straight from trajectories, with no guardrails. Metis's own code-only variant — same brittle representation, but wrapped in the full harness — *doesn't* collapse. It stays above baseline. Which means the real claim isn't "code memory is bad." It's " *ungated, trajectory-trained, unvalidated* code memory is bad." That's a much narrower, much more defensible statement, and it's easy to walk out of this paper believing the bumper sticker instead of the truth.

22:46Juniper: That's fair, and I'd add it's one benchmark, one executor model.

22:51Tyler: Which is the rest of it. Everything rides on AppWorld with GPT-4o driving. Whether code memory is *this* brittle on other domains, or with a stronger executor that recovers from tool errors more gracefully, is just untested. Clean results on one benchmark don't always survive contact with the next. And look at the comparison bar — the headline twenty percent is against ReAct with *no memory*, which is the easiest possible opponent. Against the strongest skill-building rival, on the recurring-workload split, it's basically a tie — sixty-six-point-one versus sixty-five-point-five. Metis's genuine, separating advantage is specifically under distribution shift. That's a real and useful result. It's just a narrower one than "we beat everything by twenty points."

23:44Juniper: I'll concede all of that. The single-benchmark scope is real, and the honest version of the finding is the conditional one — ungated trajectory-based code is the thing that fails. What I'd defend is that the *mechanism* doesn't depend on the benchmark. The injection asymmetry — trusted black box versus filtered advice — that's a property of how the two representations get consumed, not a quirk of AppWorld. The exact twenty-two-point collapse is evidence; the asymmetry is the idea.

24:18Tyler: And on that we agree. The principle travels even if the number doesn't. I just don't want anyone quoting "code makes agents worse" as a law when what's proven is "code, built carelessly, makes agents worse."

24:32Juniper: So here's the thing to actually carry out of this. The real result isn't Metis the system, and it isn't the twenty points. It's a correction to an engineering instinct almost everyone shares — that "real" reuse means writing a function, hardening a fuzzy procedure into clean code. This paper shows that instinct can backfire: crystallize too early and you don't just fail to help, you build confident artifacts that suppress the agent's own ability to recover when the world shifts. The durable lesson is that the *form* you store knowledge in should follow the *properties* of that knowledge — text while it's still soft and provisional, code only once a pattern has earned it. That's portable to any system where you're tempted to freeze a soft heuristic into rigid logic.

25:24Tyler: So the question for you. Should self-evolving agents get smarter about *when* to crystallize experience into code, the way this paper does — gate it, earn it, keep text as the safe default? Or is the deeper move to drop the black-box-tool idea for learned memory entirely, and keep everything as adaptable advice the agent can always second-guess? Pick a side — we read the replies.

25:50Juniper: If you want to go deeper, the full annotated version of this episode is on paperdive.ai — every technical term tap-to-define, with links to the related papers grouped by theme, from ReAct to the benchmark itself.

26:04Tyler: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Juniper and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is Metis, on bridging text and code memory for self-evolving agents, out June twenty-third, twenty twenty-six — we're recording the day after.

26:26Juniper: So the next time you're tempted to pour the concrete, watch where the trails actually form first. Pave the paths your agent keeps walking.