Memory is inference, not retrieval
The benchmark work on memory staleness draws the field's sharpest line: having the new fact in the prompt isn't enough if nothing flags the old one as superseded, especially for 'propagated' conflicts where common-sense reasoning has to retire a belief no one explicitly revoked. The same model can ace staleness recognition and fail catastrophically when a question quietly assumes the stale fact, and popular memory frameworks underperform the raw model on exactly these cases; moving adjudication to write time recovers most, but not all, of the gap E031. Mechanistically, the picture is worse for small backbones: routing competence (add/update/delete) comes online before content comprehension, producing a silent-failure regime where a model confidently overwrites memories it doesn't understand — invisible to end-to-end benchmarks because the JSON stays valid E023.
Consolidation is a learnable skill
Once memory is more than a log, the interesting question is the slow loop: consolidation. Splitting agent memory into a fast writer and a slow consolidator — with forgetting as the default and a 'thief test' reward that scores entries by what masking them does to task success — produces memory banks an order of magnitude smaller at higher success, and the consolidation skill transfers zero-shot across domains and backbones E064. Graph-structured experience with a placebo-controlled training signal (run the executor with and without memory, reward only the difference) lets a 3B copilot write a playbook that improves a frozen 32B executor E106. The most radical move dispenses with the notebook entirely: agents distill experience into flashcards and train them into a small writable slice of their own weights mid-conversation — and the structure of what gets written matters more than the writing, with QA flashcards quadrupling the value of raw transcripts E114.
Frozen weights still age
Reliability is a lifespan property, not a day-one benchmark score: the memory store, retrieval, and compaction around a frozen model keep changing every session, and agents age through compression, interference, revision, and maintenance failures that look identical in error rates but require opposite repairs. A counterfactual diagnostic ladder separates write, read, and utilization failures without model internals, and a one-paragraph change to the compaction prompt — naming what must be preserved verbatim — extends useful lifespan roughly 4.5x E086. The security corollary belongs here too: persistent memory means an attack can be planted once and fire days later in someone else's session, which is taken up in the security topic E113.
Episodes anchoring this topic
- When Your AI Assistant Won't Let Go of Old Facts About You
Reframed agent memory from retrieval to inference, with the recognize-but-still-act-on-stale-facts gap.
- When Agent Memory Stops Being a Database and Starts Being a Skill
Established consolidation as a learnable, transferable skill with forgetting as the default.
- Why Frozen-Weight Agents Still Get Worse Over Time
Named the four mechanistically distinct aging modes of frozen-weight agents and the counterfactual ladder that tells them apart.
- Agents That Rewrite Their Own Weights Instead of Just Taking Notes
Moved memory from prompt space into the weights themselves via mid-episode flashcard training.
- Giving Agents a Notebook Instead of New Weights: How ExpGraph Lets Frozen Models Learn
The placebo-controlled utility reward for experience, and the small-copilot-improves-big-executor result.
- Why a Small Agent Confidently Overwrites Memories It Doesn't Understand
The mechanistic finding that routing competence precedes content comprehension, predicting silent memory corruption in small backbones.