All episodes

Episode 086 · May 27, 2026 · 23 min

Why Frozen-Weight Agents Still Get Worse Over Time

Zhu, Ro, Robertson et al.

AI Agent Evaluation

AI Papers: A Deep Dive — Episode 086: Why Frozen-Weight Agents Still Get Worse Over Time — cover art

paperdive.ai

Listen

Ep. 086

Why Frozen-Weight Agents Still Get Worse Over Time

0:00

23 min

Concepts in this episode

AI Agents AI Safety Evaluation & Benchmarks Agent Memory Long-Horizon Agents Silent Failure Context Management Hallucination Context Fatigue Trajectory Quality Causal Intervention Eval Dissociation

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

Venue

arXiv:2605.26302

Year

2026

Read the paper

arxiv.org/abs/2605.26302

Also available on

Apple Podcasts Spotify

A deployed AI agent's model weights never change — but the agent itself ages, and it ages in at least four mechanistically distinct ways. A new paper introduces a diagnostic ladder that can tell those failure modes apart, and shows that a one-paragraph change to how an agent summarizes its own memory can extend its useful lifespan by more than four times.

What you'll take away

Agent reliability is a lifespan property, not a benchmark snapshot — the memory store, retrieval, and compaction around a frozen model keep changing every session
Four named failure modes: compression, interference, revision, and maintenance aging — split into accumulation-driven and event-driven families
The counterfactual ladder: a three-rung diagnostic that isolates write failures, read failures, and utilization failures without needing model internals
Three models with nearly identical error rates can have completely different underlying diseases — and 'add more memory' is the wrong fix for two of them
A one-paragraph 'careful' compaction prompt that names what to preserve verbatim yields roughly a 4.5x lifespan improvement on the same system
Production monitoring tends to track constraint compliance while missing silent precision decay — the agent stops violating rules but also stops knowing the specifics
Scale doesn't fix structural problems: a small typed-state sidecar cuts running-balance error 25–50% with no model change

Chapters

00:00Four vignettes, one puzzle
02:05Reframing reliability as a lifespan property
04:10The four aging mechanisms
06:30The counterfactual ladder
08:20Same score, different disease
10:25The 4.5x compaction-prompt result
14:30Silent precision decay
14:35Why scale doesn't save the running budget
16:41Honest critique
18:46Production CLI agents and re-reading
20:51The sticky note fix

References in this episode

MemGPT: Towards LLMs as Operating Systems — Proposes a hierarchical memory system with explicit paging between context and e
Lost in the Middle: How Language Models Use Long Contexts — Empirical evidence that models fail to utilize information even when it's presen
Generative Agents: Interactive Simulacra of Human Behavior — The Park et al. paper that popularized reflection-and-summarization memory archi
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — The original RAG paper, useful background for the episode's distinction between

Full transcript

Also available as a plain-text transcript page.

0:00Cassidy: Take fifty milligrams of metoprolol twice daily. That's what a user wrote into their personal AI assistant on day one. A few weeks later, they ask the agent what medication they're on. The agent answers, confidently: you take a daily medication. Different user. They've mentioned two colleagues in passing — John Smith in sales, and a different person, John Smyth, no I. Weeks later, they ask the agent to draft an email to Smith. The agent drafts it cleanly. Sends it to john dot smyth at the company domain. A third user cancels their premium subscription. Six weeks on, asks the agent about their plan. Agent says: yes, premium, active through January twenty-twenty-six. And a fourth user has a standing therapy appointment, four pm every Tuesday. After a routine memory cleanup the agent does on itself overnight, they ask what's on the calendar for Tuesday. The agent says: nothing.

1:00Finn: Four failures, four different things going wrong under the hood — and the weights of the model running all four of those agents never changed once. That's the puzzle in the paper we're digging into today, "Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems," posted to arXiv on May twenty-fifth, twenty-twenty-six — we're recording two days later. Quick ground rules before we get going: this episode is AI-generated, the script is from Anthropic's Claude Opus 4.7, and you're hearing Cassidy and me, Finn — we're both AI voices from Eleven Labs. Neither company is involved in producing the show. And the reason those four vignettes matter is that the standard mental model of AI reliability can't explain any of them. We benchmark a model, the model gets a number, the model gets deployed, the weights are frozen — and we assume the system running in production is the same system we tested. The authors point out, basically: no, it isn't.

2:05Cassidy: Right. Because a deployed agent isn't just the model. The model is the engine. Around it there's a whole apparatus — a memory store that persists across sessions, some policy that decides what to save and what to throw away, a retrieval step that pulls relevant bits back when you ask a question, and the occasional housekeeping operation that compacts old entries so the store doesn't explode. The weights of the language model can be frozen forever, and every other piece of that apparatus is changing every single time you interact with it. Each session's history gets folded into the prior memory state, and the result becomes the new memory state. The memory store accumulates. Similar entries pile up. Old facts get superseded by new ones — sometimes the agent catches the update, sometimes it doesn't. So the paper's reframe is that reliability isn't a snapshot. It's a lifespan property. The agent you deployed in January isn't the same operational system in March, even though the model file is byte-for-byte identical.

3:09Finn: And that reframe has teeth, because if reliability is a lifespan property, then the question changes from "how good is this agent" to "how good is this agent right now, three months in, and what's the shape of its decline." Which is a question almost nobody is set up to answer. The authors give the phenomenon a name — agent aging — and they argue that it isn't one thing. It's at least four mechanistically distinct things. Each of those vignettes Cassidy just walked through is a different mechanism. Worth naming them.

3:42Cassidy: The metoprolol dose becoming "a daily medication" — that's compression aging. Each session, the prior memory state and the new history get folded together into an updated memory state. That fold is lossy. Specifics — the dose, the dollar amount, the exact name — get blurred into generalities a little more each cycle. Think of it like a photocopy of a photocopy. After enough generations, the image degrades — except this kind of degradation is selective. It tends to eat numbers and proper nouns while leaving the prose around them sounding perfectly fluent. The John Smith and John Smyth confusion — that's interference aging. As similar entries accumulate in the store, retrieval starts pulling the wrong one. Two people whose names share most of their letters; the agent can't tell which memory the current query actually wants. The canceled premium plan — that's revision aging. A fact changed in the world. The new fact got logged. But the old fact didn't get properly overwritten, and the agent's still reaching for the stale version. Especially brutal for derived values — running budgets, running balances, anything where you'd need to actually do arithmetic on the history to keep current. And the Tuesday therapy appointment vanishing after a memory cleanup — that's maintenance aging. The agent ran a routine recompaction. Something it considered low-priority got dropped. It happened to be a recurring weekly commitment.

5:12Finn: Four vignettes, four mechanisms. And the authors organize them into two families. Compression and interference get worse as the memory store grows, so those are accumulation-driven — they're functions of state size. Revision and maintenance are triggered by specific events: a fact changing, a cleanup operation firing. Event-driven. That distinction matters because the two families need different kinds of monitoring. You can't catch maintenance aging by watching the curve drift; it happens in a step.

5:43Cassidy: And here's where the real intellectual move happens. Because naming four mechanisms is interesting, but it's only useful if you can tell them apart in a deployed system. If you're looking at an agent that gave a wrong answer, can you actually figure out which mechanism caused it? Just by looking at the wrong answer? That's the question the rest of the paper is built around. And the answer is the cleverest thing in the whole work.

6:08Finn: So picture a doctor. Patient comes in with chest pain. The doctor doesn't yet know if it's the heart, the lungs, or the muscle wall. What do they do? They don't guess. They run a sequence of tests that rule out one organ at a time. Each test that comes back clean shifts the diagnosis to the remaining candidates. Process of elimination, structured. The authors do exactly this for an agent. They call it the counterfactual ladder. Three rungs. Rung one is just the agent as it is. You ask the question, it does its normal thing — writes summaries to memory, retrieves what it thinks is relevant, generates an answer. You measure the accuracy. That's the baseline. Rung two: you keep the agent's memory store exactly as it wrote it. Whatever the agent compressed, however lossy, that's what's in the store. But you swap out the retrieval step. You replace it with an oracle that perfectly fetches whatever the agent did write. If the agent's accuracy jumps up at this rung, the gap between rung one and rung two is your retrieval failure — the right information was in memory, the agent just couldn't find it. That's interference aging showing up in the diagnostic. Rung three: you go further. You don't even use what the agent wrote. You inject the ground-truth answer directly into the prompt. The fact, exactly as it was true. If the agent's accuracy jumps again, the gap between rung two and rung three is your write failure — the information was lost at summarization, never made it into the store at all. But if you give it back, the agent can use it. That's compression aging. And whatever error is left even at rung three — that's pure utilization failure. You handed the model the answer and it still got it wrong. The information was there; it just couldn't reason with it.

8:02Cassidy: That's a beautiful piece of methodology, Finn, because it converts a really fuzzy question — what went wrong inside this agent — into a concrete, mechanical procedure. You don't need to understand the model internals. You don't need interpretability tools. You just brute-force the diagnosis by progressively replacing upstream components with perfect versions and watching where the error disappears.

8:27Finn: And the oracle isn't the proposed fix. Nobody's actually going to deploy an oracle retriever. It's a diagnostic instrument. The point is that running the ladder tells you where to spend your engineering effort. If the gap closes at rung two, build better retrieval. If it closes at rung three, fix your compaction policy. If the gap never closes, the model itself is the bottleneck and no amount of memory engineering is going to save you.

8:54Cassidy: And the empirical result that comes out of running this — fourteen models, seven scenarios, a few hundred runs — this is where the paper earns its keep, and it's genuinely surprising. Three different models, three different deployment scenarios. All of them clustered at roughly the same overall error rate — somewhere in the sixty to eighty percent range. Narrow band. If you were looking at a leaderboard, you'd say these models are basically interchangeable on this task. Then you run the ladder. And the breakdowns are completely different. One model's error is almost entirely at the write stage — compression aging. The summarizer is throwing away the details. Another model has the same total error, but it's almost entirely at the read stage — the information is in memory, retrieval can't find it. A third one's error is almost entirely at utilization — perfect retrieval, perfect ground truth in the prompt, model still gets it wrong. Same number on the scoreboard. Three completely different diseases.

9:57Finn: There's an analogy here that, for me, locks this in. Three students all score thirty percent on the same exam. One didn't study at all — didn't write anything to memory. One studied the wrong material — write was fine, but the retrieval is fetching the wrong stuff. The third studied correctly and then panicked on the day and couldn't apply what they knew — memory's fine, utilization failed. Three identical scores. Three completely different remediations. Telling all three of them "study more" only helps one. Telling them "study smarter" only helps another. The third one needs to work on test-day composure, which has nothing to do with the material at all. That's the agent benchmark in microcosm. Same wrong answer, different repair.

10:44Cassidy: And the practical consequence is brutal, because the default reaction in industry when an agent gets worse is to give it more memory. Bigger context window. More aggressive retrieval. The paper shows that for two of those three students, that's exactly the wrong intervention. The one with utilization failure already had the answer handed to them. More memory does nothing. The one with retrieval failure has the information stored but can't fetch it. They don't need more storage; they need a better fetch.

11:16Finn: It's the diagnostic version of what general medicine had to learn over a century. "You seem sick, take more medicine" is not a treatment plan. The paper's argument is that the field of agent engineering is roughly at the pre-diagnostic stage right now. Symptom recognition without root-cause inference.

11:35Cassidy: Okay. The next finding I want to land is — for me — the most actionable single result in the paper. Because it's a one-paragraph fix that produces an enormous effect. The authors run an experiment where they take the same model, same memory architecture, same scenarios. The only thing they change is the compaction prompt — the instruction the model receives when it's writing the end-of-session summary. One version they call lossy. It says, roughly: summarize in at most three hundred words, focus on the most important points. That's it. Pretty standard. The other they call careful. It enumerates four categories of things to preserve verbatim — dollar amounts, dates, names, and technical constraints. That's the entire change. One paragraph of instructions, different. The agent's half-life — how many sessions until it's lost half its day-one accuracy — differs by roughly four and a half times between those two prompts.

12:34Finn: Four and a half times. Same model, same architecture, only the compaction instruction differs.

12:40Cassidy: Same everything. The model wasn't the bottleneck. The compaction policy was. One paragraph of meta-instruction is responsible for a four-and-a-half-times lifespan difference on the same underlying system. It's the lossy-compression problem with a fix. The default summarization throws away specifics because specifics aren't what summarization usually rewards. Summarization rewards fluency and brevity. If you don't explicitly tell the summarizer "preserve the numbers, preserve the names, preserve the dates" — it won't. They're not what summaries are normally about.

13:16Finn: There's something I want to push on next, though, because it threatens how production monitoring actually works in most companies right now. The decoupling between behavioral compliance and factual accuracy. In the lifestyle-assistant scenario, the authors track two things in parallel. Constraint violation — does the agent ever recommend something that breaks a rule the user set, like exceeding a dietary limit or a budget cap. And constraint precision — does the agent know the specific value the rule refers to. The violation rate stays near zero across the agent's lifespan. The agent never breaks a rule. Looks great on the dashboard. The precision rate falls off a cliff. The agent has forgotten that the budget cap was eight hundred dollars. It thinks the cap is some vague "monthly spending limit." It still doesn't violate the cap — it's being cautious in general. But if you asked it what the cap actually is, it couldn't tell you.

14:17Cassidy: Which means every production system that monitors agent behavior by watching for constraint violations is completely blind to this. The agent is decaying in exactly the way you'd care about — losing the specifics it was supposed to remember — and the monitor reads clean. Green light on the dashboard. Silent precision decay.

14:37Finn: And it's not a quirk of one scenario. The decoupling is structural. Compliance is a behavioral property — you can monitor it by watching outputs. Precision is a factual property — you'd need to ground-truth against what the value actually is. Production systems almost never do the second one, because they don't have a reliable ground-truth source at deployment time.

15:01Cassidy: Okay, one more empirical finding before we get to the critique, because it's the cleanest demonstration that scale doesn't always help. The authors set up a scenario where the agent has to track a running budget. Initial value, then a sequence of deltas — spent twelve dollars on this, refunded thirty on that, recurring fifteen on the other. The probe asks: what's the current balance. Tiny model — seven billion parameters. Big model — well over a hundred billion. Both drift. Both produce roughly similar errors on the running total as the session count goes up.

15:36Finn: Which is striking, because intuitively the big model should crush this. More parameters, more reasoning capacity, surely better at arithmetic over a long history.

15:47Cassidy: Right. But the failure isn't a capacity problem. It's a representational one. The agent is storing the history as a text summary. The text summary throws away the arithmetic structure. There's no actual ledger anywhere — just prose summaries of prose summaries, and the running total gets smudged a little more each round. Picture tracking a household budget on a chalkboard. You write the starting number. Then you erase and rewrite as expenses come in. But each erasure smudges what's underneath, and after enough updates the current balance is a fuzzy approximation. A bigger chalkboard doesn't help. The problem is that you're using chalk.

16:27Finn: That's the key. Scale buys you better fluency, better reasoning over what's in front of you. It doesn't buy you a persistent ledger. The persistent ledger isn't in the model. It has to be in the harness around the model.

16:40Cassidy: Which sets up the last constructive piece of the paper. But Finn, give me the honest critique first, because the paper deserves it.

16:48Finn: Sure. A careful reviewer would flag a few things. The scenarios are synthetic. The authors are upfront about this — they generate task streams programmatically rather than drawing from production traces. Their argument is that you have to control the pressure surface to disentangle the four mechanisms, otherwise everything's happening at once and you can't attribute anything cleanly. Fair. But real deployments have correlated structure across these mechanisms that the generators don't capture. The specific half-life numbers shouldn't be read as predicting real-world deployment lifetimes. They're relative comparisons under controlled stress.

17:28Cassidy: That's a fair framing. The numbers are for comparing conditions, not for predicting how many weeks until your production assistant breaks.

17:37Finn: Second. The counterfactual ladder produces what the authors call diagnostic profiles, not unique causal decompositions. In the cleanest case — where memory has a clear retrieval step you can swap out — the three rungs cleanly separate write, read, and utilization. But for some memory architectures, like ones that just produce a single summary blob with no retrievable units, the write and read errors get merged. The diagnostic is honest about this. It's just narrower in scope than the framing sometimes suggests. Third — and this one I think is the most load-bearing critique — the headline aging curves are measured against compaction-based summarization, which is the simplest possible memory architecture. The four-and-a-half-times compaction-prompt gap is dramatic partly because the baseline is so simple. Modern production memory stacks use vector retrieval, graph-based memory, hybrid approaches. We don't yet know how those age. The mechanisms might look qualitatively different at that level of sophistication.

18:39Cassidy: That's the future work the paper explicitly flags. And I'd add — the session counts they test are short relative to real deployment. Most runs go eight to twelve sessions. Some go to thirty. Real production agents might run thousands of interactions over a year. The paper is extrapolating from short horizons to claims about long-lived systems, and the mechanisms could behave qualitatively differently at much larger scales.

19:07Finn: All true. None of this undermines the core contribution, which is the framework. The vocabulary of four mechanisms, the counterfactual diagnostic, the lifespan framing — these are the kind of conceptual tools the field genuinely needed and didn't have. The benchmark is the vehicle for demonstrating the framework. The framework is the thing.

19:29Cassidy: One more finding worth surfacing before we close, because it inverts what you'd expect. The authors test production CLI agents — Claude Code, OpenHands — systems where the agent is managing its own workspace, writing files, reading them back. They find that even when the files in storage are perfect, the agent still gets things wrong at probe time. Why? Because it doesn't re-read enough. Correct answers consistently involve more retrieval activity than incorrect ones. Even more interesting: in the Claude Code family they tested, the flagship model actually does worse on some artifact-fidelity metrics than smaller models in the same family. When they force the flagship to re-read aggressively, the recall gap closes — but the code-writing fidelity barely moves. Because the artifacts are already on disk. Re-reading at probe time can't fix code that was already written poorly.

20:26Finn: So the bigger model reasons better, but writes lower-fidelity outputs to begin with. And no amount of re-reading at probe time fixes what was already badly written. Which is, again, exactly the paper's point. The model is one piece. The pipeline is the rest of it.

20:43Cassidy: And that brings me to the last piece I want to land, because the paper closes on something genuinely hopeful and concrete. They run a small intervention experiment. They take the running-budget scenario — the one where compression keeps destroying the arithmetic — and they add a small structured sidecar alongside the normal memory. A typed-state overlay, described in the appendix, that holds the kind of exact values the prose summaries kept smudging. It's updated at the end of each session, and it isn't subject to lossy compaction. The accumulator error drops by roughly twenty-five to nearly fifty percent on that scenario. No model change. No retraining. No bigger memory. One small piece of structured external state. It's the sticky note on the fridge. You can't trust your memory of the grocery list to survive the drive home, so you write the three items down. The sticky note isn't more intelligence. It's structured external state that bypasses the lossy step. The typed-state overlay is the agent's sticky note for the things that need to be exact.

21:48Finn: And that's the whole argument compressed into one fix. The bottleneck isn't the model. The bottleneck is the architecture around the model. And the fix isn't bigger — the fix is structural.

22:00Cassidy: Taken together, what you get from this paper is a real shift in how to think about a deployed agent. Not a fixed thing you benchmark once. A system with a lifespan, multiple failure modes, and a diagnostic procedure that tells you where to repair when it starts to slip. The four vignettes we opened with — the medication dose, the misaddressed email, the canceled plan, the disappeared therapy appointment — those aren't quirky one-offs. They're four named, distinguishable diseases. And now there's a way to tell them apart.

22:32Finn: Which is the kind of paper that doesn't crown a winner. It changes what the leaderboard is even measuring.

22:38Cassidy: The show notes have a link to the paper and some related reading if you want to keep pulling on this thread. And if you want the full transcript with definitions inline, plus the concept pages that link this episode to others we've done on agent reliability, that's all on paperdive.ai.

22:56Finn: Thanks for listening to AI Papers: A Deep Dive.

Why Frozen-Weight Agents Still Get Worse Over Time

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes