The Bug Where Smart Assistants Read a Fact and Still Forget It
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
A frontier model can read that you moved to the suburbs and still insist it has no idea where you live — and neither a bigger model nor 24x more memory closes that gap. This paper argues every AI lab that shipped persistent memory in 2026 is treating a behavior problem as a storage problem, and shows the one intervention that actually moves the needle.
What you'll take away
- Why a model can score 92% reading the full conversation but drop to 77% maintaining the same facts from compressed notes — and why that 'supersession gap' is a maintenance problem, not a comprehension one
- The 13-to-1 result showing the failure is real and one-directional, not statistical noise
- Why a bigger model and 24x more memory both fail to close the gap, with the desk-clutter intuition for why extra storage helps and hurts in equal measure
- How a reinforcement-learning reward that targets which version of a fact is *current* nearly doubles held-out accuracy on a small model (9% to 16.7%)
- The training curve that 'switches on' exactly when the behavior is learned — the cleanest evidence it's real learning, not luck
- Why the headline training result is a single-seed proof of mechanism, and the specific cracks (lenient matching, small question counts, one kind of scale) the episode is honest about
Chapters
- 00:00A fact it read and lost
- 01:45Why memory becomes a sticky note
- 04:00Is the failure even real?
- 06:37Does a smarter model save you?
- 07:38Can you just buy more memory?
- 11:33Training the model to keep facts current
- 16:16The curve that switches on
- 18:06Where the result actually lands
- 21:44Train the habit or change the substrate?
References in this episode
- LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory — The benchmark this episode's diagnosis is built on — its knowledge-update questi
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — The paper that introduced GRPO, the critic-free reinforcement learning method wh
Full transcript
Also available as a plain-text transcript page.
0:00Bella: A top-tier language model gets asked where Rachel moved to. The answer — the suburbs — is sitting right there in the conversation it just had. And the model replies: "I have no information about Rachel." It isn't confused. It read the sentence. It just lost the fact.
0:19Eric: Heads up before we start — this is an AI-made explainer, and the two voices you're hearing are AI as well. Now, that failure has a number attached to it. That same frontier model scores 92% on these update questions when it can re-read the whole conversation. The moment it has to work from its own short-term memory instead, it drops to 77%. That fifteen-point hole is what this whole paper is about.
0:47Bella: And the part that should bother you is what doesn't close that hole. A bigger, smarter model? The gap survives. Twenty-four times more memory to work with? It recovers, and I mean this literally, nothing. By the end you'll understand why both of the obvious fixes fail — and the one intervention that actually moves the needle.
1:09Eric: This is a paper called "Supersede," posted to arXiv on June 25th, 2026, by a researcher named Vedant Patel. And the reason a non-specialist should care is simple: every major AI lab shipped persistent memory into its assistant in 2026. OpenAI's "Dreaming," Claude's memory, Gemini's personalization. The instant you give an assistant memory, you inherit the job of keeping that memory current as the user's life changes. And this paper argues the whole industry is treating that as a storage problem when it's really a behavior the model has to be trained to do.
1:48Bella: Let's nail down the setup, because every result rides on one distinction. A language model reads everything you give it inside a fixed buffer — its working memory for a single interaction, the context window. The trouble is a conversation that runs for months can't fit. The paper mentions over a hundred thousand tokens for a single question in some conditions. Too long, too slow, too expensive.
2:16Eric: So real systems don't keep the whole transcript.
2:19Bella: Right — they have the agent write itself a compressed summary, a notes field, and carry only that forward. The raw history is thrown away. Think of it as the difference between answering a question with the entire email thread open in front of you, versus answering from a sticky note you scribbled three weeks ago. Now here's the catch that makes this hard. When a fact changes — you moved, a price changed, a plan got revised — the agent has to actively notice the change and overwrite its own note. If it doesn't, the stale value just sits there and poisons every answer after it.
2:59Eric: And the paper gives that act a name.
3:01Bella: Supersession. Correctly recognizing that a new value cancels an old one, and using the current one while discarding what's stale. It's not enough to know your friend got married — you have to stop calling her by her maiden name. The accuracy an agent loses when it has to do this from compressed notes instead of full context, that's the supersession gap. And the grounding fact that makes this real: OpenAI's own internal evaluation of updating outdated context self-reports about 75% success. That's up from under 10% in 2024 — huge progress — but it means one in four time-sensitive updates still slips. So the paper asks four questions, in order. Does this failure actually exist as something distinct? Does a bigger model fix it? Does a bigger memory fix it? Does training fix it?
3:54Eric: And the answers, just to set the table, are no, no, no — and then yes.
3:59Bella: That's the spine. Let's take them one at a time. Question one: is this a real, separate failure, or just the model being bad at reading? The design here is the cleverest part of the whole paper. Patel takes the knowledge-update questions from LongMemEval — a trusted benchmark where a fact is stated in one session and changed in a later one — and runs the exact same questions under two conditions.
4:26Eric: Same questions, that's the key.
4:28Bella: Same questions. In the first condition, full-context, every session is in the model's window — it can re-read anything. That's the ceiling, where memory is never the bottleneck. In the second, bounded-memory, the agent sees one session at a time, maintains a notes field of about 300 characters, and — this is the crucial rule — the raw sessions are never fed back. The only thing that survives is what the agent chose to write down. Because the questions are identical, the difference between the two conditions is purely the cost of maintaining memory. Nothing else moved.
5:05Eric: It's two students taking the same exam on a long novel. One has the book open on the desk and can flip to any page. The other read it once and may only consult the half-page of notes they jotted while reading. Ask them both where the main character ends up living. If the note-taker never wrote down the move, they're stuck — not because they're dumber, but because the fact never made it onto the card.
5:32Bella: Exactly that. And the result is clean. On the frontier model, full-context got thirteen questions right that bounded-memory got wrong. The reverse — bounded winning where full-context lost — happened once. Thirteen to one.
5:46Eric: Thirteen to one. That's not noise.
5:49Bella: That's the whole point of the statistical test they run — a check on whether the errors all fall the same direction or could be a coin flip. Thirteen versus one is overwhelmingly one-directional. Highly significant. So the gap is real, and it's specifically about maintenance, not comprehension.
6:08Eric: Although there's a wrinkle in your student analogy that the paper actually leans into. The agent wrote its own notes as it was reading. So it's both the note-taker and the test-taker. The failure isn't just bad recall — it's failing to anticipate, in the moment, that this fact would matter later and needed overwriting.
6:29Bella: Which is why "just understand better" won't save you. And that sets up question two. The default reflex in this field is: capability problems get absorbed by scale. Make the model bigger and the rough edges smooth out. So — does a stronger model close the gap?
6:48Eric: This is where I'd have bet yes, honestly.
6:51Bella: Most people would. Patel runs it across a small model and the frontier model. Watch the two lines. Full-context accuracy climbs and saturates — the small model already reads at 82%, the frontier model tops out around 92. As a pure reader of context, the model basically solved it. But the bounded-memory line? The small model sits at 63, and the frontier model crawls to 77. It improves, but it never catches up, and the gap stays wide open.
7:22Eric: So a better reader of context is still a poor maintainer of memory. The skill that scaled is not the skill that's failing.
7:31Bella: That's the sentence. Comprehension scaled. Maintenance didn't. So scaling the model is off the table.
7:38Eric: Which brings us to the experiment I think is the sharpest single piece of reasoning in the paper — question three. The intuitive objection at this point is obvious: the memory is just too small. Three hundred characters? Of course it's failing. Give it more room.
7:57Bella: And you can't just make the memory bigger and call it done, because two things move together when you do that — how long the conversation is, and how compressed the notes are relative to it. Patel pulls them apart.
8:12Eric: Step one: grow the conversation alone. Stretch it twenty-four times longer — from a couple of sessions to around forty-eight — keeping the memory the same size. Accuracy collapses. Sixty-eight percent down to twenty-eight. A forty-point drop.
8:30Bella: So memory pressure clearly matters.
8:32Eric: It does. Step two — and this is the test. Now give the agent twenty-four times more memory to match the longer conversation. Same ratio it started with. The intuitive prediction is that accuracy climbs back toward sixty-eight. So what's the recovery?
8:50Bella: Go on.
8:51Eric: Zero. Twenty-eight percent to twenty-eight percent. Exactly nothing. Twenty-four times the memory bought no recovery at all.
9:00Bella: Okay, but hold on — my first instinct is that the model just ignored the extra space. It wrote the same short note and left the rest blank.
9:09Eric: That's the natural read, and it's wrong — which is what makes this beautiful. They checked. Every single one of the twenty-five answers changed between the two conditions. All twenty-five. The extra memory was absolutely being used. It just helped on some questions and hurt on exactly as many others.
9:29Bella: So the net was a wash.
9:31Eric: A perfect wash. Picture a desk that keeps getting more cluttered. A bigger desk doesn't help you find the one current phone number if you also keep more old, crossed-out, half-relevant scraps lying around. More surface area is more room for the right note and more room for stale notes to mislead you. The failure tracks the length of the conversation, not the compression ratio. You cannot buy your way out with capacity.
9:58Bella: And that's the practically damning part, right? Because growing the notes forever, in proportion to a conversation that never stops growing — that's not even a viable strategy at scale. You'd be back to storing everything.
10:13Eric: Which is the thing you bounded the memory to avoid in the first place. So three obvious escapes are now closed. The failure is real, a bigger model doesn't fix it, a bigger memory doesn't fix it. Everything that was supposed to make this go away quietly, didn't.
10:30Bella: So what's left is the interesting claim — that supersession isn't a comprehension skill at all, it's a learned policy. A habit the agent has to be optimized to perform: which value to overwrite, which to discard, which to keep. And the last quarter of the paper is Patel building a tool to train exactly that, then showing the axis actually moves. The technical core is next — reinforcement learning, a custom reward, and a method called GRPO — and it pays off in a training curve that switches on at precisely the moment the model learns the behavior, which is the cleanest evidence in the paper that this is real learning and not luck.
11:15Eric: Before we climb into that, let me be honest about where it lands, because it matters for how you read it. This training result is what Patel himself calls a proof of mechanism, not a finished product. One small model, one run. Keep that flag up — we'll come back to exactly how much weight it can bear.
11:37Bella: Fair. So let's build the tool. The diagnosis becomes a reward. Reinforcement learning, at cocktail-party depth, is just this: let the model try something, score the result, and nudge it toward whatever earned a higher score. Practice, score, adjust, repeat. The reward function is simply the rule that hands out the score.
12:00Eric: And the whole novelty is what they choose to reward. Think of training a dog with treats — except the treat isn't for fetching a newspaper, it's specifically for fetching the newest newspaper. The reward here is: a point if your final answer carries the current value of the fact, a penalty if you assert a value you should know is stale. It rewards temporal currency directly.
12:27Bella: Why does "directly" matter so much? Isn't that what other memory work does?
12:32Eric: No, and this is the cleanest way to place the paper. There's a small active line of work on using reinforcement learning to improve agent memory. One line rewards final task success. Another rewards whether the retrieved evidence was relevant. Useful — but they're proxies. They reward what to answer. None of them reward which version of the fact is the live one. Patel's line for it: prior work learns what to answer; this environment learns which version is current. The supersession signal is the reward, not a stand-in for it.
13:07Bella: And there was a precursor — a "forgetting-aware" metric that penalized leaning on outdated memory.
13:13Eric: Right, but that existed as a score you'd apply to a frozen model after the fact. A benchmark you could fail. The move here is turning that measurement into a training target you can actually optimize against. That's the shift — from a number you observe to an objective you push on.
13:32Bella: Now, there's a data decision underneath this that I think is genuinely instructive, because it explains why this bug went under-measured for so long. Patel first built the obvious thing: a generator of templated update timelines. "I live in X… now I live in Y." Clean, controlled. And with the full history in context, frontier models score 100% on those.
13:55Eric: A hundred. So the easy version is useless as a test.
13:58Bella: Useless as a diagnostic. It's a spell-checker test where every misspelling is already flagged in bright red — it tells you nothing about whether the proofreader can catch errors on their own. The model just scans for the last mention of the fact and wins. Real conversations bury the update in paraphrase and small talk, with no flag waving. That's why the diagnosis had to run on messy real data — and why the field probably missed this. The clean benchmark was hiding the bug.
14:30Eric: But here's the move that makes that synthetic data earn its keep. It's too easy for a frontier model reading full history — but for a small model that has to maintain the fact in bounded memory, it's an unsolved behavior. So Patel demotes it from a test to a curriculum. Training practice.
14:48Bella: So the structure of the training experiment is a real transfer test, not a tautology.
14:54Eric: That's the design. Train the model on synthetic episodes — six to eight sessions, one fact introduced and later superseded, buried among distractor sessions. Then evaluate on the real, held-out LongMemEval conversations it never saw in training. Two different worlds. If it transfers, you've learned a policy, not memorized a template.
15:16Bella: And the method is GRPO. The one thing worth knowing about it?
15:19Eric: Traditional reinforcement learning trains a second model — a critic — to judge how good each answer was. GRPO skips the critic. It generates a whole batch of answers to the same question and grades each one against the batch average. Better-than-average gets reinforced, worse-than-average gets discouraged. Instead of hiring an external examiner, you grade each student's essay against the rest of the class that day. It pairs perfectly with a clean automatic scoring rule — no judge model needed, the environment scores itself programmatically.
15:55Bella: And it has an emergent property that turns out to matter.
15:58Eric: It does. Once every answer in the batch is already correct, there's no spread left — every essay nailed it. The relative signal goes to zero. There's nothing to learn from, so training stops itself. Keep that in mind, it'll matter in a second.
16:15Bella: So here's the payoff, and I want to frame it as a prediction first. If supersession really is a trainable policy, then training on these episodes should push held-out accuracy up — and the gain should appear right when the model has actually acquired the behavior, not before. So, the small open model — a 3-billion-parameter model, because you can't fine-tune the frontier ones — starts at 9% on the real held-out set. After training: 16.7%.
16:47Eric: Roughly nine to roughly seventeen. Nearly double.
16:51Bella: Nearly double. In raw counts, that's going from seven of seventy-eight questions right to thirteen. And now the part that's the actual evidence. Watch the curve. On the synthetic training episodes, the reward climbs from 0.66 to 0.97 and then the run self-terminates around step 175 — because, exactly as Eric said, the batch got solved and the signal vanished. Meanwhile the held-out accuracy on real data stays flat through step 100 — flat, nothing — and then it switches on. Twelve-point-eight, then sixteen-point-seven. It rises monotonically, and it turns on precisely when the model has acquired the behavior.
17:35Eric: That timing alignment is the whole argument. If this were a lucky seed or the test harness leaking, you wouldn't see the held-out gain wait, and then arrive in lockstep with the learning. The shape of the curve is doing the work that a single number can't.
17:53Bella: So the four-question arc closes. Does it exist — yes. Bigger model — no. Bigger memory — no. Training — yes. The two axes that should have helped didn't move, and the one built to be trained, moved.
18:08Eric: And now I want to spend the flag I planted, because this is where I think honesty earns the channel its keep. That "nearly doubles" headline rests on a single training run. One seed. Statistical significance of the training gain is not formally established — we're talking about thirteen correct answers out of seventy-eight. Patel says multi-seed runs are in ongoing work, which is the right thing to be doing, but it means "trainable" is, right now, suggestive rather than settled.
18:41Bella: Though the monotonic curve is a real defense against the lucky-seed worry.
18:45Eric: It is, and I'll grant it — it's genuine evidence of learning. But it's not a substitute for repetition. And there are two more cracks worth naming. First, the grading. Both the base and trained models are scored by a deliberately lenient automatic matcher — substring matching, with a fallback. Patel's defense is correct as far as it goes: since base and trained are graded identically, the delta is fair. But the absolute numbers — nine percent, sixteen-point-seven — should be read as "matcher-graded," and with counts this small, the result is sensitive to a handful of borderline calls.
19:24Bella: And the second crack?
19:26Eric: The scale experiment — the gorgeous twenty-eight-to-twenty-eight result. When the conversation grows twenty-four times, what's actually growing is the distractor sessions around the tracked fact. The conversation gets longer, but the number of updates to that fact doesn't. Patel flags this himself, to his credit — growing surrounding noise and growing the chain of updates both stress memory, but they're not the same stressor. So the clean "it's scale, not size" headline is technically about one specific kind of scale: more noise, not more revisions. And that null result — the no-recovery finding — is measured on just twenty-five questions. The forty-point collapse is robust at that size; the precise "exactly zero recovery" is thinly powered.
20:15Bella: So the steelman is: the diagnosis is strong, well-controlled, and significant — the gap is real, it survives a bigger model, it survives a bigger memory. The training conclusion is the genuinely new direction, but it's a first data point, not a closed case.
20:33Eric: That's where I land. And I don't think Patel would argue — the paper says as much. Sixteen-point-seven percent on a 3-billion model is nowhere near deployable. The thing being claimed isn't "we solved agent memory." It's "here is the axis nobody was optimizing, and here is the first concrete evidence it moves." Those are very different claims, and the paper is careful to make the smaller one.
20:58Bella: And it ships as a tool, not just a finding. The environment, the trained model, the dataset — released open, on standard reinforcement-learning infrastructure. So anyone building a production memory system inherits both a diagnosis and something they can train against, instead of only a benchmark they can fail.
21:19Eric: Which is the part I think outlives the specific numbers. Before this, "keep your facts current" lived in the comprehension column — assumed to be a side effect of a smart enough model. This relocates it to the memory-policy column. A behavior you have to deliberately optimize, the same way you'd train any other skill.
21:39Bella: So here's the real takeaway, bigger than the method. The reflex across this whole field is that memory failures get absorbed by the next bigger model or solved by allocating more storage. This paper closes off both of those bets with controls — and what's left is uncomfortable. Keeping an assistant's facts current isn't something scale hands you for free. It's a policy the model has to learn: which version of a fact is live, and which to throw away. Competence and current memory are two different things — your assistant can fully understand that you moved, and still answer from the old address.
22:18Eric: And that reframes the work in front of every lab that shipped memory this year. The question stops being "how big should the notes field be" and starts being "what are we training the agent to do with it."
22:32Bella: So here's what I'd put to you. Given that bigger and more both failed — is the right path the one this paper takes, teaching the model a currency policy with a reward that targets which fact is live? Or is the self-written sticky note itself the wrong design, and we should move to a memory architecture that tracks versions and timestamps explicitly, instead of asking a model to remember to overwrite itself? Train the habit, or change the substrate — tell us which way you lean, and why.
23:06Eric: The full annotated version of this episode is on paperdive.ai — every technical term tap-to-define, with links to the related papers grouped by theme, including LongMemEval and the forgetting-aware metric this builds on, plus our weekly and monthly roundups.
23:23Bella: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Eric and I are both AI voices from Eleven Labs, and the producer isn't affiliated with Anthropic or Eleven Labs. The paper is "Supersede," by Vedant Patel, posted June 25th, 2026; we recorded this on June 29th.
23:43Eric: The model knew Rachel moved. It just never updated the note. Figuring out how to teach that one habit — that's the work that's left.