All episodes
Episode 168 · Jun 24, 2026 · 27 min

When Turning Experience Into Code Makes Your AI Agent Dumber

Dai, He, Li et al.

LLM Agents
AI Papers: A Deep Dive — Episode 168: When Turning Experience Into Code Makes Your AI Agent Dumber — cover art
paperdive.ai
Ep. 168
When Turning Experience Into Code Makes Your AI Agent Dumber
0:00
27 min
Paper
Metis: Bridging Text and Code Memory for Self-Evolving Agents
Venue
arXiv:2606.24151
Year
2026
Read the paper
arxiv.org/abs/2606.24151
Also available on
Apple Podcasts Spotify

An AI that its hard-won experience into reusable code scored ten points worse than an agent with no memory at all. This episode unpacks why the sophisticated-looking move — freezing lessons into callable tools — is also the fragile one, and what the right fix turns out to be. You'll come away understanding the single most basic decision in building agents that learn on the job: when a lesson should stay as soft advice, and when it's earned the right to become code.

What you'll take away

  • Why storing an 's experience as callable code can drop it below an agent with no memory at all — a 22-point collapse the moment it has to generalize
  • The 'injection asymmetry': text is consumed as adaptable advice you filter through reality, while code is a trusted whose flaws propagate to every caller and suppress the 's own recovery behavior
  • Metis's 'text first, code earned' policy — sorting experience into plans, facts, and pitfalls, and crystallizing only recurring plans into tools using the desire-path principle
  • Why the codifier deliberately never reads the messy , building tools from the clean query pattern instead — and how that lets even failed runs safely count toward codification
  • The that proves the gate: an 'Eager' version cost 47% more to build, scored worse, and left over half its tools never invoked
  • Where the clean story has a seam: the headline result is really about ungated, -trained, unvalidated code on a single benchmark — not a law that 'code memory is bad'

Chapters

  1. 01:57The brilliant employee with amnesia
  2. 03:01Text advice or a black-box tool?
  3. 04:50The experiment that fixed every variable
  4. 08:43The 22-point collapse
  5. 10:06Why the confident tool fails hard
  6. 13:07Paving only the paths people walk
  7. 18:13Does the machinery actually pay off?
  8. 21:41The seam in the clean story
  9. 24:30Don't pour the concrete too early

References in this episode

Also available as a plain-text transcript page.

0:00Juniper: An AI solves a hard task, learns from it, and — being clever — turns that lesson into a reusable tool so it never has to think it through again. Then that tool makes it worse than if it had learned nothing at all.

0:15Tyler: Quick before we get into it — this is an AI-made explainer, both voices included.

0:21Juniper: And that opening isn't a hypothetical. It's a measured result. There's a setting in this paper where an that stored its experience as callable code scored fifty-three percent. The same agent with no memory whatsoever scored sixty-three. The "smart" upgrade cost it ten points.

0:39Tyler: Which is the kind of result that should not happen. The whole premise of giving an memory is that remembering helps. So by the end of this you'll understand exactly why the more sophisticated-looking option — turning experience into software — is also the more fragile one, and what the right fix turns out to be.

1:00Juniper: And here's why this matters beyond one benchmark. The next wave of AI is supposed to learn on the job — get better the more they're used. This paper is about the single most basic decision in that whole program: when your agent figures something out, how do you store it so future-you actually benefits? Almost everyone has been answering that by taste. This is the team that ran the experiment.

1:26Tyler: The paper is called Metis, out of the Chinese University of Hong Kong and Huawei, and it does two separable things — a diagnostic study that surfaces something genuinely surprising, and a system built as the answer. We'll take them in that order, because the answer only makes sense once you feel the problem.

1:46Juniper: So start with the problem the whole field is circling. Today's language-model are stateless. Picture a brilliant employee with severe amnesia — genuinely sharp every morning, and every evening it all evaporates. Whatever an agent figured out solving your task today is gone the moment that conversation scrolls out of its . Tomorrow, similar task, it starts from zero. Re-derives, re-explores, makes the same mistakes, burns the same .

2:19Tyler: And the we're talking about isn't a chatbot spitting out one block of text. The mental model is a language model sitting at a terminal — it reasons a step, runs a command or calls an , reads what came back, reasons again, repeats until the task is done. That think-act-observe loop is the pattern, and it's the floor everything in this paper is measured against. The question the paper asks is narrow and deep at once: what do you feed back into that loop from past tasks, and in what form?

2:55Juniper: Because there's a , and almost nobody has examined it. Option one — store the lesson as text. Natural-language notes you paste back into the prompt next time. "When booking a flight, check the user's saved payment methods first." Option two — store it as code. An actual function the can call, hand it some arguments, get a result, and skip re-deriving the whole procedure.

3:22Tyler: And the crucial thing — this is subtle and easy to get wrong — the difference isn't prose versus programming. It's how the *consumes* the thing. Text is advice the agent reads and interprets. It can adapt it, partially apply it, or ignore it if it doesn't fit. Code is a the agent invokes and trusts. Same underlying lesson, two completely different relationships to it. One is a suggestion. The other is a command the agent runs verbatim.

3:54Juniper: And the line the authors keep coming back to is that this choice "is typically made at design time rather than derived from the characteristics of the experience itself." A system architect just decides up front, we're a text shop, or we're a tool-building shop. Nobody asked which one is actually better, and for what.

4:17Tyler: So before we get to their answer, let me flag the thing we'll come back to at the end. What they actually establish is that one *specific flavor* of code memory is brittle — ungated, from a single run, with no . Whether "code memory" in general is doomed is a bigger claim, and that gap is going to matter. Hold onto it.

4:41Juniper: Fair flag. But let's earn the surprise first. The diagnostic. What makes this study clean is they fixed everything that usually muddies these comparisons. Same benchmark — , a simulated world of nine apps exposing four hundred and fifty-seven , where the has to actually change the state of the to pass, not just produce nice-sounding text. Same stream of tasks. And a sharp methodological move — they split the roles across two models. A cheaper model, , is the executor that *uses* the memory. A stronger model, , is the reflector that *builds* the memory.

5:24Tyler: And that split is doing real work. It separates the cost of *constructing* knowledge from the cost of *consuming* it. If one model did both, you couldn't tell which side of the ledger an effect was coming from. So they hold the experiences identical and vary one knob only — stored as text the reads, or code the agent calls — and then measure three things.

5:51Juniper: First axis: construction cost. How expensive is it just to build the memory? Text wins, and not by a little. Building the code memory took about five hundred and sixty model rounds of back-and-forth. The text version took two hundred and twenty. Roughly two and a half times the interaction.

6:12Tyler: Which makes total sense once you picture the work. Writing a text note is reading the experience once and jotting it down. Building a code tool is software engineering — draft it, run it in a , read the docs, hit an error, debug, retry, until it's trustworthy. The expensive part of shipping code was never the typing. It's the test-and-debug cycle.

6:38Juniper: And the authors are careful here, which I appreciated. In raw the gap was only about one-point-three times — much milder. They explicitly say the *rounds* are the sharper signal, because rounds capture that debugging depth. They don't oversell the token number.

6:57Tyler: Second axis flips it, though.

6:59Juniper: It does. Execution efficiency — how cheaply does the run *once it has* the memory. Code wins decisively. A callable tool collapses a whole multi-step procedure into one invocation that skips the reasoning entirely. On the tasks both methods commonly solved, code cut execution by fifty-four percent. Text only cut them thirty-eight. So if you just stopped here, you'd conclude: code is pricey to build, but it pays for itself at runtime. Build it once, run it cheap forever.

7:34Tyler: And that's exactly the intuition the third axis is about to detonate.

7:40Juniper: Third axis: transfer reliability. Does the memory still help on tasks it wasn't built from? And to measure that honestly they ran the same memory under two regimes. Call the first one Oracle. You build the memory from all ninety tasks, you hand- it — they even manually deleted one tool they could see was broken — and then you test on those same tasks. Best case. In-sample. The optimistic ceiling.

8:09Tyler: Studying the exact questions that'll be on the exam.

8:13Juniper: Right. The second regime is Streaming, and it's the realistic one. Tasks arrive in order, and each task can only use memory from strictly *earlier* tasks. No peeking ahead. Memory built from the past, applied to a genuinely new present. And the gap between Oracle and Streaming is precisely how much of the memory's benefit is real generalization versus an illusion of fitting the test.

8:39Tyler: So show the bars. Because this is the moment.

8:42Juniper: Under Oracle, code memory looked fantastic — clean, fast, accurate. Then you switch to Streaming. Text memory drops a little — about five and a half points — and stays comfortably ahead of everything, up at seventy-three percent. Code memory drops twenty-two points. And twenty-two points down lands it at fifty-three — below the with no memory at all, sitting at sixty-three.

9:09Tyler: So let that fully register. The did the expensive, sophisticated thing. It its experience into real reusable software. And the result wasn't a smaller benefit, or a wash. It actively dragged the agent below the version that remembered nothing. Building the tools made it dumber.

9:29Juniper: And that's the headline reversal. Code that looks great in-sample collapses the instant it has to generalize. Text degrades gracefully. The two forms don't just differ in degree — they fail in opposite *characters*.

9:44Tyler: Which is the real question, and it's the best idea in the paper. *Why* does code fail this way? Why doesn't text? And the authors give it a name — the injection asymmetry — and honestly they hand you the perfect analogy for it.

9:59Juniper: Go for it, because this is your thread, Tyler.

10:03Tyler: So think about two coworkers. The first one gives you advice. "Hey, when you handle these refund tickets, double-check the account's currency first." You read that, you weigh it against what's actually in front of you, and you adapt it — or you ignore it if this ticket's different. That's text. It's a suggestion you filter through reality. The second coworker hands you a finished black-box widget and says, "don't think about it, just run this, trust the output." That's code. And here's the asymmetry. If that widget was built from one flawed example, it propagates the defect to every single person who calls it. The paper's phrase is that a faulty tool "propagates its defect to every caller."

10:52Juniper: And it's worse than just passing along a bug.

10:55Tyler: Much worse, and this is the part that got me. The mere *presence* of a confident-looking tool suppresses the 's own recovery behavior. When the agent reads a text tip and the world looks wrong, it keeps checking, keeps self-correcting — the tip was only ever advice. But when it calls a tool, it trusts the return value. If the tool quietly hands back garbage, or an empty result, the agent takes that at face value and stops investigating. It's the confident colleague who's wrong — and because they sound authoritative, everyone stops double-checking the situation themselves.

11:38Juniper: So the very thing that makes code efficient — that it lets the stop reasoning — is the thing that makes it brittle. Skipping the reasoning is the feature *and* the bug.

11:50Tyler: Exactly. Text fails soft because it's consumed as adaptable advice. Code fails hard because it's consumed as trusted control flow. The consumption mode *is* the mechanism. That's the whole insight, and it's why you can't fix code memory just by writing better tools — the fragility is structural to how a gets used.

12:13Juniper: Okay. So here's the before we build anything. We've got three facts on the table. Text is cheap to build and transfers reliably but runs slow. Code is expensive to build and runs fast but is dangerously brittle under any shift. And the cause of the brittleness is that code is trusted blindly while text is filtered through reality. Now — the system. Metis is just those three facts turned into a policy, and the policy fits in four words: text first, code earned.

12:47Tyler: This next stretch is the design — three choices, each one a direct counter to a failure we just diagnosed — and the payoff is a single rule that decides when an should stop thinking and just run the script. Let the diagram carry the structure; we'll carry the why.

13:07Juniper: First choice. Don't treat text as one flat playbook. Metis sorts text experience into three flavors, because they have different jobs. Plans — reusable procedural templates, the "how to do this kind of task" recipes. Facts — environment constraints, the "this app requires a login " kind of thing. And pitfalls — stored as little trigger-mistake-consequence packets: when *this* situation comes up, *this* is the error you'll be tempted to make, and *here's* what it costs you.

13:41Tyler: And the load-bearing move there is that only *plans* describe something mechanically executable. A fact isn't a procedure. A pitfall isn't a procedure. So only plans are even eligible to become code later. The taxonomy is secretly a filter for what can be crystallized.

14:00Juniper: Which is the second choice, and it's the heart of the system. The gate. A plan does not become code the first time it works. It only gets promoted into a tool after it's been selected across enough distinct tasks to prove it's a stable, recurring pattern.

14:19Tyler: And the analogy the authors basically live inside here is desire paths. You know how cities decide where to lay sidewalks? The smart ones don't pour concrete the moment one person cuts across the grass. They wait and watch where the dirt trails actually form from repeated foot traffic — and then they pave those. Pave too early and you've spent concrete on a route nobody uses, and locked in a layout that might be wrong.

14:47Juniper: That's it exactly. Building a tool costs two and a half times what a note costs, and a tool only pays off if it's reused. So you wait for evidence of repeated use before you pay the construction bill. The threshold is the dial — set it high, you almost never codify; set it low, you're back to building junk eagerly.

15:08Tyler: But wait — here's where I'd expect it to break. If you only count tasks the plan *succeeded* on, you'd codify really slowly, right? You'd be throwing away half your signal.

15:20Juniper: That's the clever bit. The task doesn't have to succeed to count toward the gate. Even a failed run signals "this query belongs to the same procedural family." It still tells you the strategy keeps coming up.

15:34Tyler: Which only works because of the third design choice — and this is the one I think is genuinely sharp. When Metis finally codifies a plan, the codifier *never reads the *. It doesn't look at the actual run.

15:48Juniper: Say more, because that sounds backwards. The is the richest record you have.

15:54Tyler: It's too rich. A is full of incidental junk — failed attempts, task-specific constants, one-off variables, the exact account number from that one run. That's great for diagnosing a single execution and *terrible* as the blueprint for a general black-box tool. Remember the injection asymmetry — a tool from a flawed trajectory propagates the flaw to everyone. So the codifier gets only three things: the abstract plan, the set of *queries* that triggered it, and a live . The plan gives the procedure, the queries define the variation the tool has to cover, the sandbox grounds it in reality.

16:41Juniper: And that's *why* failed tasks can safely count toward the gate. Since the codifier only consumes the query, not the messy , a failed run adds coverage of the input space without poisoning the tool with the failure itself.

16:58Tyler: Right — the gate and the query-only codification are the same idea from two sides. One says "wait until it recurs," the other says "and when you build it, build from the pattern, not the mess." Together they're the structural fix for the brittleness we measured.

17:18Juniper: There are two smaller pieces worth one sentence each. Dependency closure — when the grabs a tool, the system automatically pulls in every helper function that , and every helper those call, until nothing's missing. Batteries included in the box, so the lightweight model selecting tools never has to reason about internals. And reflection — the whole memory-building process — runs off the critical path, on finished , asynchronously. So the expensive part never slows down what a user actually waits for.

17:58Tyler: So the full object, in one breath: text by default, sorted into plans, facts, and pitfalls; plans get crystallized into tools only after they've proven they recur; and the crystallization works from the clean pattern, not the dirty run. The question now is whether all that machinery actually buys anything.

18:21Juniper: And this is where I want to be careful to frame it as a prediction first. If the theory is right — if text-first-code-earned is actually the correct policy — we should see two things. The should get *more accurate*, because the brittle artifacts that dragged it below baseline are gone. And it should get *cheaper to run*, because the tools that survive the gate are exactly the ones worth invoking. More accurate and cheaper at the same time. And that's what they get.

18:53Tyler: Give the numbers.

18:55Juniper: Against a no-memory , Metis comes out up to about twenty percent more accurate while cutting execution cost by almost a quarter — twenty-two-point-eight percent on one split. In absolute terms, it's about eight points of accuracy on the split that tests distribution shift, and about eleven points on the split that tests recurring workloads — and notice which gap is bigger.

19:21Tyler: The recurring-workload one. Which is striking, because that's the easy case — and even there, building tools the right way is worth eleven points over remembering nothing.

19:32Juniper: And there's a nice villain-foil in the lineup. One pure text-based baseline, ACE, actually *inflated* execution cost up to three-point-eight times — nearly quadrupled what the had to spend — for a tiny accuracy bump. So Metis isn't just beating "no memory." It's beating the failure mode where memory becomes bloat the agent has to wade through.

19:56Tyler: But the result that actually proves the gate — the desire-path principle — isn't the headline. It's the . And this is the number I'd keep.

20:08Juniper: This is the one that turns the whole philosophy into something you can feel. They built an "Eager" version of Metis — same system, but it codifies after *every* task instead of waiting for . Pave the path the instant one person walks it. And the Eager version cost forty-seven percent more to build — eleven-point-six million versus seven-point-nine. It scored three and a half points *worse*. And the kicker is in the head-to-head on the identical dev workload: only forty-one percent of Eager's tools were ever invoked, against fifty-six percent for the full gated system. A fifteen-point gap on the same tasks. And given a larger workload, the gated system's tools get exercised even harder — ninety-four percent of them got used.

20:59Tyler: So eager codification spent half again as much money to manufacture a library where, even on its home turf, more than half the shelf is pavement leading nowhere — tools the never once calls. The gate isn't primarily a cost-saver. It's a *quality filter*. The requirement is what keeps the library full of paths people actually walk.

21:24Juniper: And one more contrast I liked, on the retrieval side — every single Metis test task got at least one relevant piece of memory back. With one of the -building baselines, about twenty-one percent of tasks retrieved nothing and had to start from scratch. So coverage, not just quality.

21:44Tyler: Okay. Now I have to do the thing this channel does, because the clean story has a seam in it, and it's the one I flagged at the top.

21:53Juniper: The "code is bad" framing.

21:55Tyler: Right. Because the most quotable result here — code memory falls below no-memory — is true, but it's not the whole truth, and the paper itself is honest enough to tell you so. That dramatic collapse came from a deliberately *minimal* code setup: ungated, straight from , with no . Metis's own code-only variant — same brittle representation, but wrapped in the full — *doesn't* collapse. It stays above baseline. Which means the real claim isn't "code memory is bad." It's " *ungated, trajectory-trained, unvalidated* code memory is bad." That's a much narrower, much more defensible statement, and it's easy to walk out of this paper believing the bumper sticker instead of the truth.

22:46Juniper: That's fair, and I'd add it's one benchmark, one executor model.

22:51Tyler: Which is the rest of it. Everything rides on with driving. Whether code memory is *this* brittle on other domains, or with a stronger executor that recovers from tool errors more gracefully, is just untested. Clean results on one benchmark don't always survive contact with the next. And look at the comparison bar — the headline twenty percent is against with *no memory*, which is the easiest possible opponent. Against the strongest -building rival, on the recurring-workload split, it's basically a tie — sixty-six-point-one versus sixty-five-point-five. Metis's genuine, separating advantage is specifically under distribution shift. That's a real and useful result. It's just a narrower one than "we beat everything by twenty points."

23:44Juniper: I'll concede all of that. The single-benchmark scope is real, and the honest version of the finding is the conditional one — ungated -based code is the thing that fails. What I'd defend is that the *mechanism* doesn't depend on the benchmark. The injection asymmetry — trusted versus filtered advice — that's a property of how the two representations get consumed, not a quirk of . The exact twenty-two-point collapse is evidence; the asymmetry is the idea.

24:18Tyler: And on that we agree. The principle travels even if the number doesn't. I just don't want anyone quoting "code makes worse" as a law when what's proven is "code, built carelessly, makes agents worse."

24:32Juniper: So here's the thing to actually carry out of this. The real result isn't Metis the system, and it isn't the twenty points. It's a correction to an engineering instinct almost everyone shares — that "real" reuse means writing a function, hardening a fuzzy procedure into clean code. This paper shows that instinct can backfire: crystallize too early and you don't just fail to help, you build confident artifacts that suppress the 's own ability to recover when the world shifts. The durable lesson is that the *form* you store knowledge in should follow the *properties* of that knowledge — text while it's still soft and provisional, code only once a pattern has earned it. That's portable to any system where you're tempted to a soft heuristic into rigid logic.

25:24Tyler: So the question for you. Should self-evolving get smarter about *when* to crystallize experience into code, the way this paper does — gate it, earn it, keep text as the safe default? Or is the deeper move to drop the black-box-tool idea for learned memory entirely, and keep everything as adaptable advice the agent can always second-guess? Pick a side — we read the replies.

25:50Juniper: If you want to go deeper, the full annotated version of this episode is on paperdive.ai — every technical term tap-to-define, with links to the related papers grouped by theme, from to the benchmark itself.

26:04Tyler: Quick housekeeping: this script was written by Anthropic's , Juniper and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is Metis, on bridging text and code memory for self-evolving , out June twenty-third, twenty twenty-six — we're recording the day after.

26:26Juniper: So the next time you're tempted to pour the concrete, watch where the trails actually form first. Pave the paths your keeps walking.