All episodes
Episode 192 · Jul 02, 2026 · 22 min

A 32B Open Model Matched Frontier Systems By Learning to Take Notes

Wu, Zhu, Zhang et al.

LLM Agents Memory Management
AI Papers: A Deep Dive — Episode 192: A 32B Open Model Matched Frontier Systems By Learning to Take Notes — cover art
paperdive.ai
Ep. 192
A 32B Open Model Matched Frontier Systems By Learning to Take Notes
0:00
22 min
Paper
AutoMem: Automated Learning of Memory as a Cognitive Skill
Venue
arXiv:2607.01224
Year
2026
Read the paper
arxiv.org/abs/2607.01224
Also available on
Apple Podcasts Spotify

A mid-sized open model pulled level with and on grueling long-horizon games without getting one bit smarter — it just learned to manage its own memory. treats note-taking as a trainable and uses a to hundred-thousand-step transcripts no human could read. You'll come away with a concrete case that on long tasks, memory discipline may beat raw scale — and a sharp sense of where that claim wobbles.

What you'll take away

  • What '' means as a trainable : knowing what to write down, when to check notes, and how to organize so future-you can find things
  • The trick that makes memory auditable — turning read/write/search into first-class logged actions in the
  • The map fix: an file bloating at 138 characters per step, cut to 6 with a coordinate-keyed upsert, letting the survive thousands of steps instead of hundreds
  • Why better memory paradoxically makes the model read less — up to 30% fewer input per step
  • The headline comparison: a scaffolded 32B beats the same-family 72B on all three games and lands near and
  • The critique: how much of the gain is 'the learned a ' versus a writing better code and filtering data for a smaller one

Chapters

  1. 00:18Fix the notebook, not the brain
  2. 01:48Memory as skill, not plumbing
  3. 04:28How do you see a memory decision?
  4. 05:39The map that went from drowning to saving
  5. 08:52Training the note-taker without touching the player
  6. 11:58Did the reflex actually take?
  7. 13:12Note-taking beats doubling the parameters
  8. 16:01Where the claim gets shaky
  9. 19:26Where would you spend your next dollar?

Also available as a plain-text transcript page.

0:00Juniper: A 32-billion- open model just pulled level with and on three punishingly long games — and it didn't get smarter to do it. It learned to take better notes.

0:11Finn: Quick before we start — this is an AI-made explainer, both voices included. And that opening line is the whole paper in one breath. Same , same reasoning, no scale-up. They just taught the how to manage its own memory — and doubled, tripled, in one case nearly quadrupled its performance. By the end of this you'll understand exactly what "learning to take notes" means as a trainable , and why a mid-sized open model reaching frontier level this way is a bigger deal than the number alone suggests.

0:45Juniper: And here's the part that should make you squint. The way they train this involves reading the 's entire play-through to find where its memory went wrong. In one of these games a single episode runs to a hundred thousand steps. No human is reading that. So the obvious question is — how do you supervise a skill when the evidence of doing it badly is buried in a transcript nobody can read?

1:10Finn: Which is exactly why this matters beyond one paper. For a few years now the default answer to "make better at long tasks" has been: bigger model, longer reasoning. This is a bet that the real bottleneck on long-horizon work isn't the thinking — it's the agent losing track of what it already knows and already tried. Fix the notebook, not the brain.

1:33Juniper: The paper's called — Automated Learning of Memory as a Cognitive Skill, out of Stanford, posted July first, 2026. So let's start with why memory is a bottleneck at all. A language model doesn't actually remember anything between one step and the next. Everything it can reason about right now has to fit inside its — think of it as the scratch paper you're allowed on your desk at once. It's a fixed size. And a long task blows right past it. So when the paper runs to tens of thousands of steps, older material has to get thrown out or crushed down.

2:10Finn: Right, and the standard fix has been to give the a filing cabinet next to the desk. A retrieval database, a vector store, a . The field basically treats memory as plumbing — a fixed mechanism you design in, bolt on, and then leave alone.

2:26Juniper: And that's the assumption cracks open. Because in people, memory management isn't hardware. It's a . Cognitive scientists have a name for it — . Knowing what's worth writing down, when to go back and check something, how to organize your notes so future-you can actually find things.

2:47Finn: The grad-student-versus-veteran gap.

2:49Juniper: Exactly that. A first-year and a twenty-year researcher both take notes. But the veteran's notes are radically more useful — they don't re-copy what they already have, they key things so they're findable, they know what to skip. That's not raw intelligence. It's a learned about managing your own memory. And the paper's premise is: an AI can climb that same curve. So memory stops being a gadget you install and becomes a habit you practice.

3:19Finn: Which is a lovely framing, but it smuggles in the hard problem you flagged. If memory is a you improve through feedback — what's the feedback? Because a memory mistake doesn't announce itself. You fail to record a coordinate at step 50, and it doesn't bite you until step 800 when you're lost and re-exploring ground you already covered.

3:41Juniper: And that's the wall. The learning signal for this is, for practical purposes, beyond human review. Nobody's auditing a hundred-thousand-step to find the one bad write. So before we get to the clever part, I want to flag the tension we'll come back to at the end — the reviewer that solves this problem is itself a stronger model than the one being trained. Hold that thought, because how much of the gain is "the learned a skill" versus "a wrote better code for it" is the sharpest question in the paper.

4:19Finn: Noted. So — how do you even see a memory decision to review it?

4:24Juniper: This is the move everything else rests on. Normally an has task actions — move north, attack, craft a stone pickaxe. takes memory operations — read, write, search, append, create a file — and drops them into the same menu. The exact same decision step that could pick "go east" can instead pick "append to the dungeon map" or "search my inventory notes."

4:49Finn: So a memory decision becomes a logged event in the , like any other move.

4:55Juniper: That's the unlock. Once "write to my map" is a first-class action, it's visible. It's in the transcript. What got written, what got searched, what got buried under duplicates — all of it is now an auditable event instead of something happening invisibly inside the machinery. The runs two little routines each step: one asks "what's worth recording about what just happened?" — that's logging — and one asks "what do I need to to act right now?" — that's planning. Both are out in the open.

5:31Finn: And that's what makes the reviewer possible at all. You can't critique a ; you can critique a log.

5:38Juniper: Right. Now — the technical core is two nested loops, and they pay off in the cleanest concrete example in the paper: a map file that goes from drowning the to saving its life. Picture two concentric feedback cycles sharing one agent in the middle. The outer loop rewrites the agent's tools. The other outer loop retrains a piece of the agent's brain. Renovate the kitchen, then train the cook.

6:06Finn: Let's take the kitchen first — the tools.

6:09Juniper: The first loop is what they call optimization. It targets structure — the prompts, the file formats, the actual operations available. And the reviewer here is a strong model — they use — handed the entire episode trace plus the 's own code. Not a score. The whole execution log. It reads it like a senior engineer reading your commit history and going, "here, at step 50, you buried a useful value under duplicates, and that's why you got lost at step 800."

6:43Finn: The distinction between that and a reward number is the whole reason it works, isn't it. A final score says "you got forty percent." Useless for repair — it doesn't tell you where.

6:56Juniper: That's the crux. A scalar reward can't locate a memory bug. A reviewer reading the full trace can. So watch what it actually does — this is the map. In , the roguelike that runs to a hundred thousand steps, the starting keeps a file called dungeon map, and it's . Every single time the agent walks past a tile it's seen before, it writes a brand-new line: "there is floor here." Revisit it ten more times, ten more lines.

7:28Finn: So it's the travel journal where every time you pass the same café you write "there is a café here" again.

7:35Juniper: And after a week the journal has ten thousand entries and you can't find anything. The useful stuff is drowning. The reviewer reads the trace, diagnoses exactly that, and introduces a new operation — an upsert on the map, keyed by coordinate. Now revisiting a tile updates the one line for that spot instead of adding another. On screen you can watch the file go from this bloated wall of duplicates to one clean entry per location. Per-step growth of that map file drops from 138 characters to 6. Ninety-five percent smaller.

8:10Finn: And the payoff isn't tidiness for its own sake.

8:14Juniper: No — that one fix lets the survive thousands of steps where it used to die within a few hundred. The map stops burying the information the agent needs to not get lost. And every rewrite like this has to earn its place: the revised agent replays the same fixed random , and the change is kept only if average progress actually improves. Runs a handful of rounds until code revision has nothing left to give.

8:42Finn: So loop one squeezes structure dry. But there's a ceiling on what a tool can do, right? You can build the most beautiful labeled-drawer kitchen in the world and the cook can still grab the wrong pan.

8:56Juniper: That's exactly the seam into loop two, Finn — and it's your half.

9:01Finn: So here's the limit of the . You can write a prompt that says "check your existing notes before you write a new one." You cannot make the model actually do it. A prompt is an instruction you're told to follow; it isn't a reflex. Loop two bakes the reflex into the . And the way they train it is the part I keep coming back to, because it's easy to misread. When they "train a memory specialist," the reviewer — again — is not inventing correct answers and teaching them. It reads a pool of episodes and picks out which of the 's own responses were good memory decisions. Every training example is verbatim text the agent itself produced. The stronger model is a filter on the smaller model's behavior, not a teacher writing new answers. It's reinforcing the agent's own better instincts.

9:50Juniper: That distinction is load-bearing for the whole "it learned a " claim. If the teacher were generating the answers, this would obviously just be one model copying another.

10:00Finn: Right — the claim only holds because the source material is the 's own output. Now, the training itself uses — instead of retraining the whole 32-billion- model, you bolt on a small and tune only that. Cheap, and it doesn't disturb the . And the reviewer jointly picks both the training data and the recipe, because a dataset and a training configuration have to match each other.

10:25Juniper: Here's where I want to slow you down, because this is the piece that's easy to muddle. You've now got a trained memory specialist. But the still has to, you know, play the game. So which model is driving?

10:39Finn: Both — and this is the architecturally slippery bit, so let me use the picture. At runtime there are two instances sharing one running conversation. The — untouched, still great at picking moves — commits the actual world actions. The -tuned specialist handles the memory: the reading, the searching, the writing. The clean way to hold it: one hand takes notes, the other plays the game. You've got a chess player with a trained assistant beside them whose only job is to keep and consult the notes. You retrain the note-taker to be excellent — and you never touch the player's game sense. So the player never gets one bit worse at chess.

11:21Juniper: And that's why the memory gains stack cleanly instead of trading off. Because the model's action is never in the training loop, improving the notes can't corrupt the moves.

11:33Finn: The one honest caveat on the analogy — it's not two separate people. It's two instances of the same underlying model sharing one transcript. So the hand-off is tighter and more seamless than passing notes across a table. But the "trained note-taker, untouched player" intuition is exactly right about what's and what isn't.

11:54Juniper: So does the reflex actually take? Because the could already prompt "consult before you write." The question is whether training turns that prompt into a habit.

12:05Finn: This is the cleanest evidence in the paper that something was learned, not just told. They measure the ratio of memory writes to searches. A high ratio means the is writing blindly — dumping new content without checking what it already has. If training instilled the habit, that ratio should fall — the agent should search its files before appending. And it falls in every environment. In it drops from 4.66 down to 1.31 — a seventy-two percent cut. The behavior the could only encourage becomes a behavior the model has.

12:40Juniper: So — , because we've built the whole object now. Memory becomes visible by making it an action. Loop one rewrites the tools until code revision is tapped out — that's the map fix. Loop two retrains a small memory specialist on the 's own best decisions, and parks it beside a player. The question left is the one that pays for all of it: how much does this actually move the needle?

13:07Finn: So let's put a number on it. Set the prediction first: if better memory really is a high-leverage bottleneck, then optimizing memory alone — untouched during the phase — should move performance a lot, not a little. And it does. , the shorter survival game, goes from twenty-five percent to forty-seven — call it doubled. more than triples. nearly quadruples. Then the proficiency training adds another layer on top of each.

13:38Juniper: And that's before we get to the comparison that actually made me sit up.

13:42Finn: This is the one. Take the same family and just double the size — 72-billion instead of 32. The scaffolded 32B beats the 72B by a wide margin on all three games. Better note-taking beat more than doubling the count. And it lands right around 's level, within a few points of Thinking on these tasks. A mid-sized open model, matching frontier proprietary systems — purely on memory discipline.

14:12Juniper: There's a second result I find almost more telling than the headline, because it's counterintuitive. You'd expect better memory to mean better — the finds what it stored. But the fingerprints of the optimization show something else. Steps where the agent is just stuck or pacing back and forth drop by a third to two-thirds. Redundant writes drop up to eighty-plus percent. And per-step input shrink by up to thirty percent.

14:41Finn: Wait — shrink? Better memory makes the model read less?

14:46Juniper: That's the irony. You'd assume a richer memory system means more to carry, more to attend to every step. It's the reverse. When the map is deduplicated and the notes are lean, there's less garbage in the context — so the model has less to wade through, not more. Good note-taking doesn't just help you remember. It lightens the load. The tidy notebook is smaller than the messy one.

15:14Finn: And you can watch it in the play-throughs, which are the most concrete thing in the paper. In the base loops endlessly gathering wood — two achievements out of twenty-two. The evolved crafts stone tools, builds a furnace, mines iron — twelve. The trained specialist does all that and remembers to feed itself — thirteen. In the base agent dies at experience level one within a few hundred steps. The scaffold version survives around seven thousand steps and reaches level two. The trained one survives far longer and reaches level four.

15:53Juniper: So the story holds together beautifully. Which is exactly when I want you to push on it, because you've been sitting on the reservation since the top.

16:04Finn: I have. And I want to state it in its strongest form, because the paper is good enough to deserve the real objection, not a soft one. Start with . The multiplier is "nearly quadrupled" — that's true. But the absolute numbers are 0.42 percent to 1.57. Everyone is failing hard at NetHack — even the top out around two to seven percent. A four-x gain on a task where the ceiling anyone reaches is under two percent is a much weaker claim than the same multiplier on , which actually gets to fifty. And the abstract folds those very different regimes into one tidy "two-to-four-x." On Crafter, I buy it completely. On NetHack, the honest read is: everybody's drowning, and this drowns slightly less.

16:56Juniper: That's fair. The and results carry the thesis; is more "moved a tiny needle."

17:03Finn: And then the deeper one — the tension you planted at the start. The engine of every improvement here is . Opus reads the traces and rewrites the code. Opus curates the training data. So you can tell two stories about the gain. Story one: the 32B model learned a memory . Story two: leaked downhill — a wrote better code and filtered better data for a smaller one, and we dressed it in language.

17:31Juniper: The paper's rebuttal is the filter-not-teacher point, though — the training data is the 's own verbatim output.

17:38Finn: And that rebuttal genuinely covers loop two. The specialist is trained on the 32B's own decisions, selected, not generated. Fair. But it does not cover loop one. The rewrites — the map upsert, the new operations, the schema — those are pure authorship. That's a writing code for a weaker one. And a lot of the headline gain lives in that first loop. So "the learned a " is cleanest for the training half and shakiest for the structure half.

18:09Juniper: And there's the "we only touched memory" wrinkle underneath that.

18:13Finn: Right — that's the sharpest version. They claim they optimized memory alone. But the edits also do things like block impossible craft actions and reject repeated do-nothing moves. That's shaping task strategy directly, not just memory. They frame it as a feature — better memory improves behavior as a side effect. A skeptic reads it the other way: some of those edits are task-strategy interventions wearing a memory label. Add that they tune a separate scaffold and specialist for each game, with hand-picked stopping points and different configs — so what's demonstrated is bespoke per task, and they concede generalization across environments is untested.

18:55Juniper: I'll concede the shape of all of that. The ambiguity is real and it isn't unique to this paper — any time a strong model improves a weak one, you can't fully separate "it learned" from " flowed downhill." The per-game tuning is a genuine limit they own. And the "only memory" claim is looser than the framing implies.

19:18Finn: But — and this is where I don't get to keep the last word — even granting every one of those, the reviewer-of-full-traces idea survives. Because the thing that's actually new here isn't the memory result. It's the method: hand a strong model an entire execution log — up to a hundred thousand steps, a scale no human can — and let it diagnose and revise where things went wrong. Memory is just the first thing they pointed it at.

19:47Juniper: And that's the real takeaway, bigger than the algorithm. For years the reflex has been: long tasks failing? Bigger model. This paper is evidence that on long-horizon work, how an manages its notes can be higher-leverage than how big its brain is. The bottleneck may sit in information management, not reasoning — which means the next gains might come from better memory discipline and cheaper models that keep good records, not from another scale-up. A 32B open model matching frontier systems on note-taking alone is what that reframing looks like when it's true.

20:24Finn: So here's where I'll leave you. If you were building a long-horizon tomorrow — where do you put your next dollar? Into a bigger, smarter ? Or into teaching a smaller one to manage its own memory, and using a strong reviewer to the traces humans can't read? Those pull in genuinely different directions — tell us which way you lean, and what breaks first in the version you'd ship.

20:50Juniper: The full annotated version of this episode is on paperdive.ai — every technical term tap-to-define, with links to the related work grouped by theme, from the research to the memory-as-action papers this one builds on.

21:06Finn: Quick housekeeping: this script was written by Anthropic's , Juniper and I are AI voices from Eleven Labs, and the producer isn't affiliated with Anthropic or Eleven Labs. The paper is — Automated Learning of Memory as a Cognitive Skill, out of Stanford, posted July first, 2026, and we're recording the day after.

21:30Juniper: The trick was never a bigger brain. It was knowing what to write down. See you in the next one.