All episodes

Episode 192 · Jul 02, 2026 · 22 min

A 32B Open Model Matched Frontier Systems By Learning to Take Notes

Wu, Zhu, Zhang et al.

LLM Agents Memory Management

AI Papers: A Deep Dive — Episode 192: A 32B Open Model Matched Frontier Systems By Learning to Take Notes — cover art

paperdive.ai

Listen

Ep. 192

A 32B Open Model Matched Frontier Systems By Learning to Take Notes

0:00

22 min

Concepts in this episode

Agentic AI Training Methods AI Efficiency & Cost Agent Memory Long-Horizon Tasks Agentic RL LoRA Agent Scaffolding Trajectory Analysis Context Management LLM-as-Judge Iterative Refinement Supervised Fine-Tuning Knowledge Distillation Tool Use

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

AutoMem: Automated Learning of Memory as a Cognitive Skill

Venue

arXiv:2607.01224

Year

2026

Read the paper

arxiv.org/abs/2607.01224

Also available on

Apple Podcasts Spotify

A mid-sized open model pulled level with Claude Opus and Gemini on grueling long-horizon games without getting one bit smarter — it just learned to manage its own memory. AutoMem treats note-taking as a trainable skill and uses a frontier model to audit hundred-thousand-step transcripts no human could read. You'll come away with a concrete case that on long tasks, memory discipline may beat raw scale — and a sharp sense of where that claim wobbles.

What you'll take away

What 'metamemory' means as a trainable skill: knowing what to write down, when to check notes, and how to organize so future-you can find things
The trick that makes memory auditable — turning read/write/search into first-class logged actions in the trajectory
The map fix: an append-only file bloating at 138 characters per step, cut to 6 with a coordinate-keyed upsert, letting the agent survive thousands of steps instead of hundreds
Why better memory paradoxically makes the model read less — up to 30% fewer input tokens per step
The headline comparison: a scaffolded 32B beats the same-family 72B on all three games and lands near Claude Opus and Gemini
The steelman critique: how much of the gain is 'the agent learned a skill' versus a frontier model writing better code and filtering data for a smaller one

Chapters

00:18Fix the notebook, not the brain
01:48Memory as skill, not plumbing
04:28How do you see a memory decision?
05:39The map that went from drowning to saving
08:52Training the note-taker without touching the player
11:58Did the reflex actually take?
13:12Note-taking beats doubling the parameters
16:01Where the claim gets shaky
19:26Where would you spend your next dollar?

Full transcript

Also available as a plain-text transcript page.

0:00Juniper: A 32-billion-parameter open model just pulled level with Claude Opus and Gemini on three punishingly long games — and it didn't get smarter to do it. It learned to take better notes.

0:11Finn: Quick heads up before we start — this is an AI-made explainer, both voices included. And that opening line is the whole paper in one breath. Same weights, same reasoning, no scale-up. They just taught the agent how to manage its own memory — and doubled, tripled, in one case nearly quadrupled its performance. By the end of this you'll understand exactly what "learning to take notes" means as a trainable skill, and why a mid-sized open model reaching frontier level this way is a bigger deal than the number alone suggests.

0:45Juniper: And here's the part that should make you squint. The way they train this skill involves reading the agent's entire play-through to find where its memory went wrong. In one of these games a single episode runs to a hundred thousand steps. No human is reading that. So the obvious question is — how do you supervise a skill when the evidence of doing it badly is buried in a transcript nobody can read?

1:10Finn: Which is exactly why this matters beyond one paper. For a few years now the default answer to "make agents better at long tasks" has been: bigger model, longer reasoning. This is a bet that the real bottleneck on long-horizon work isn't the thinking — it's the agent losing track of what it already knows and already tried. Fix the notebook, not the brain.

1:33Juniper: The paper's called AutoMem — Automated Learning of Memory as a Cognitive Skill, out of Stanford, posted July first, 2026. So let's start with why memory is a bottleneck at all. A language model doesn't actually remember anything between one step and the next. Everything it can reason about right now has to fit inside its context window — think of it as the scratch paper you're allowed on your desk at once. It's a fixed size. And a long task blows right past it. So when the paper runs to tens of thousands of steps, older material has to get thrown out or crushed down.

2:10Finn: Right, and the standard fix has been to give the agent a filing cabinet next to the desk. A retrieval database, a vector store, a scratchpad. The field basically treats memory as plumbing — a fixed mechanism you design in, bolt on, and then leave alone.

2:26Juniper: And that's the assumption AutoMem cracks open. Because in people, memory management isn't hardware. It's a skill. Cognitive scientists have a name for it — metamemory. Knowing what's worth writing down, when to go back and check something, how to organize your notes so future-you can actually find things.

2:47Finn: The grad-student-versus-veteran gap.

2:49Juniper: Exactly that. A first-year and a twenty-year researcher both take notes. But the veteran's notes are radically more useful — they don't re-copy what they already have, they key things so they're findable, they know what to skip. That's not raw intelligence. It's a learned skill about managing your own memory. And the paper's premise is: an AI agent can climb that same curve. So memory stops being a gadget you install and becomes a habit you practice.

3:19Finn: Which is a lovely framing, but it smuggles in the hard problem you flagged. If memory is a skill you improve through feedback — what's the feedback? Because a memory mistake doesn't announce itself. You fail to record a coordinate at step 50, and it doesn't bite you until step 800 when you're lost and re-exploring ground you already covered.

3:41Juniper: And that's the wall. The learning signal for this skill is, for practical purposes, beyond human review. Nobody's auditing a hundred-thousand-step trajectory to find the one bad write. So before we get to the clever part, I want to flag the tension we'll come back to at the end — the reviewer that solves this problem is itself a stronger model than the one being trained. Hold that thought, because how much of the gain is "the agent learned a skill" versus "a frontier model wrote better code for it" is the sharpest question in the paper.

4:19Finn: Noted. So — how do you even see a memory decision to review it?

4:24Juniper: This is the move everything else rests on. Normally an agent has task actions — move north, attack, craft a stone pickaxe. AutoMem takes memory operations — read, write, search, append, create a file — and drops them into the same menu. The exact same decision step that could pick "go east" can instead pick "append to the dungeon map" or "search my inventory notes."

4:49Finn: So a memory decision becomes a logged event in the trajectory, like any other move.

4:55Juniper: That's the unlock. Once "write to my map" is a first-class action, it's visible. It's in the transcript. What got written, what got searched, what got buried under duplicates — all of it is now an auditable event instead of something happening invisibly inside the machinery. The agent runs two little routines each step: one asks "what's worth recording about what just happened?" — that's logging — and one asks "what do I need to recall to act right now?" — that's planning. Both are out in the open.

5:31Finn: And that's what makes the reviewer possible at all. You can't critique a black box; you can critique a log.

5:38Juniper: Right. Now — the technical core is two nested loops, and they pay off in the cleanest concrete example in the paper: a map file that goes from drowning the agent to saving its life. Picture two concentric feedback cycles sharing one agent in the middle. The outer loop rewrites the agent's tools. The other outer loop retrains a piece of the agent's brain. Renovate the kitchen, then train the cook.

6:06Finn: Let's take the kitchen first — the tools.

6:09Juniper: The first loop is what they call scaffold optimization. It targets structure — the prompts, the file formats, the actual operations available. And the reviewer here is a strong model — they use Claude Opus — handed the entire episode trace plus the agent's own code. Not a score. The whole execution log. It reads it like a senior engineer reading your commit history and going, "here, at step 50, you buried a useful value under duplicates, and that's why you got lost at step 800."

6:43Finn: The distinction between that and a reward number is the whole reason it works, isn't it. A final score says "you got forty percent." Useless for repair — it doesn't tell you where.

6:56Juniper: That's the crux. A scalar reward can't locate a memory bug. A reviewer reading the full trace can. So watch what it actually does — this is the map. In NetHack, the roguelike that runs to a hundred thousand steps, the starting agent keeps a file called dungeon map, and it's append-only. Every single time the agent walks past a tile it's seen before, it writes a brand-new line: "there is floor here." Revisit it ten more times, ten more lines.

7:28Finn: So it's the travel journal where every time you pass the same café you write "there is a café here" again.

7:35Juniper: And after a week the journal has ten thousand entries and you can't find anything. The useful stuff is drowning. The reviewer reads the trace, diagnoses exactly that, and introduces a new operation — an upsert on the map, keyed by coordinate. Now revisiting a tile updates the one line for that spot instead of adding another. On screen you can watch the file go from this bloated wall of duplicates to one clean entry per location. Per-step growth of that map file drops from 138 characters to 6. Ninety-five percent smaller.

8:10Finn: And the payoff isn't tidiness for its own sake.

8:14Juniper: No — that one fix lets the agent survive thousands of steps where it used to die within a few hundred. The map stops burying the information the agent needs to not get lost. And every rewrite like this has to earn its place: the revised agent replays the same fixed random seeds, and the change is kept only if average progress actually improves. Runs a handful of rounds until code revision has nothing left to give.

8:42Finn: So loop one squeezes structure dry. But there's a ceiling on what a tool can do, right? You can build the most beautiful labeled-drawer kitchen in the world and the cook can still grab the wrong pan.

8:56Juniper: That's exactly the seam into loop two, Finn — and it's your half.

9:01Finn: So here's the limit of the scaffold. You can write a prompt that says "check your existing notes before you write a new one." You cannot make the model actually do it. A prompt is an instruction you're told to follow; it isn't a reflex. Loop two bakes the reflex into the weights. And the way they train it is the part I keep coming back to, because it's easy to misread. When they "train a memory specialist," the reviewer — Opus again — is not inventing correct answers and teaching them. It reads a pool of episodes and picks out which of the agent's own responses were good memory decisions. Every training example is verbatim text the agent itself produced. The stronger model is a filter on the smaller model's behavior, not a teacher writing new answers. It's reinforcing the agent's own better instincts.

9:50Juniper: That distinction is load-bearing for the whole "it learned a skill" claim. If the teacher were generating the answers, this would obviously just be one model copying another.

10:00Finn: Right — the claim only holds because the source material is the agent's own output. Now, the training itself uses LoRA — instead of retraining the whole 32-billion-parameter model, you bolt on a small adapter and tune only that. Cheap, and it doesn't disturb the base model. And the reviewer jointly picks both the training data and the recipe, because a dataset and a training configuration have to match each other.

10:25Juniper: Here's where I want to slow you down, because this is the piece that's easy to muddle. You've now got a trained memory specialist. But the agent still has to, you know, play the game. So which model is driving?

10:39Finn: Both — and this is the architecturally slippery bit, so let me use the picture. At runtime there are two instances sharing one running conversation. The frozen base model — untouched, still great at picking moves — commits the actual world actions. The LoRA-tuned specialist handles the memory: the reading, the searching, the writing. The clean way to hold it: one hand takes notes, the other plays the game. You've got a chess player with a trained assistant beside them whose only job is to keep and consult the notes. You retrain the note-taker to be excellent — and you never touch the player's game sense. So the player never gets one bit worse at chess.

11:21Juniper: And that's why the memory gains stack cleanly instead of trading off. Because the frozen model's action skill is never in the training loop, improving the notes can't corrupt the moves.

11:33Finn: The one honest caveat on the analogy — it's not two separate people. It's two instances of the same underlying model sharing one transcript. So the hand-off is tighter and more seamless than passing notes across a table. But the "trained note-taker, untouched player" intuition is exactly right about what's frozen and what isn't.

11:54Juniper: So does the reflex actually take? Because the scaffold could already prompt "consult before you write." The question is whether training turns that prompt into a habit.

12:05Finn: This is the cleanest evidence in the paper that something was learned, not just told. They measure the ratio of memory writes to searches. A high ratio means the agent is writing blindly — dumping new content without checking what it already has. If training instilled the habit, that ratio should fall — the agent should search its files before appending. And it falls in every environment. In NetHack it drops from 4.66 down to 1.31 — a seventy-two percent cut. The behavior the scaffold could only encourage becomes a behavior the model has.

12:40Juniper: So — checkpoint, because we've built the whole object now. Memory becomes visible by making it an action. Loop one rewrites the tools until code revision is tapped out — that's the map fix. Loop two retrains a small memory specialist on the agent's own best decisions, and parks it beside a frozen player. The question left is the one that pays for all of it: how much does this actually move the needle?

13:07Finn: So let's put a number on it. Set the prediction first: if better memory really is a high-leverage bottleneck, then optimizing memory alone — weights untouched during the scaffold phase — should move performance a lot, not a little. And it does. Crafter, the shorter survival game, goes from twenty-five percent to forty-seven — call it doubled. MiniHack more than triples. NetHack nearly quadruples. Then the proficiency training adds another layer on top of each.

13:38Juniper: And that's before we get to the comparison that actually made me sit up.

13:42Finn: This is the one. Take the same base model family and just double the size — Qwen 72-billion instead of 32. The scaffolded 32B beats the 72B by a wide margin on all three games. Better note-taking beat more than doubling the parameter count. And it lands right around Claude Opus 4.5's level, within a few points of Gemini 3.1 Pro Thinking on these tasks. A mid-sized open model, matching frontier proprietary systems — purely on memory discipline.

14:12Juniper: There's a second result I find almost more telling than the headline, because it's counterintuitive. You'd expect better memory to mean better recall — the agent finds what it stored. But the fingerprints of the scaffold optimization show something else. Steps where the agent is just stuck or pacing back and forth drop by a third to two-thirds. Redundant writes drop up to eighty-plus percent. And per-step input tokens shrink by up to thirty percent.

14:41Finn: Wait — tokens shrink? Better memory makes the model read less?

14:46Juniper: That's the irony. You'd assume a richer memory system means more to carry, more to attend to every step. It's the reverse. When the map is deduplicated and the notes are lean, there's less garbage in the context — so the model has less to wade through, not more. Good note-taking doesn't just help you remember. It lightens the load. The tidy notebook is smaller than the messy one.

15:14Finn: And you can watch it in the play-throughs, which are the most concrete thing in the paper. In Crafter the base agent loops endlessly gathering wood — two achievements out of twenty-two. The evolved scaffold crafts stone tools, builds a furnace, mines iron — twelve. The trained specialist does all that and remembers to feed itself — thirteen. In NetHack the base agent dies at experience level one within a few hundred steps. The scaffold version survives around seven thousand steps and reaches level two. The trained one survives far longer and reaches level four.

15:53Juniper: So the story holds together beautifully. Which is exactly when I want you to push on it, because you've been sitting on the reservation since the top.

16:04Finn: I have. And I want to state it in its strongest form, because the paper is good enough to deserve the real objection, not a soft one. Start with NetHack. The multiplier is "nearly quadrupled" — that's true. But the absolute numbers are 0.42 percent to 1.57. Everyone is failing hard at NetHack — even the frontier models top out around two to seven percent. A four-x gain on a task where the ceiling anyone reaches is under two percent is a much weaker claim than the same multiplier on Crafter, which actually gets to fifty. And the abstract folds those very different regimes into one tidy "two-to-four-x." On Crafter, I buy it completely. On NetHack, the honest read is: everybody's drowning, and this agent drowns slightly less.

16:56Juniper: That's fair. The Crafter and MiniHack results carry the thesis; NetHack is more "moved a tiny needle."

17:03Finn: And then the deeper one — the tension you planted at the start. The engine of every improvement here is Claude Opus. Opus reads the traces and rewrites the code. Opus curates the training data. So you can tell two stories about the gain. Story one: the 32B model learned a memory skill. Story two: capability leaked downhill — a frontier model wrote better code and filtered better data for a smaller one, and we dressed it in metamemory language.

17:31Juniper: The paper's rebuttal is the filter-not-teacher point, though — the training data is the agent's own verbatim output.

17:38Finn: And that rebuttal genuinely covers loop two. The specialist is trained on the 32B's own decisions, selected, not generated. Fair. But it does not cover loop one. The scaffold rewrites — the map upsert, the new operations, the schema — those are pure Opus authorship. That's a frontier model writing code for a weaker one. And a lot of the headline gain lives in that first loop. So "the agent learned a skill" is cleanest for the training half and shakiest for the structure half.

18:09Juniper: And there's the "we only touched memory" wrinkle underneath that.

18:13Finn: Right — that's the sharpest version. They claim they optimized memory alone. But the scaffold edits also do things like block impossible craft actions and reject repeated do-nothing moves. That's shaping task strategy directly, not just memory. They frame it as a feature — better memory improves behavior as a side effect. A skeptic reads it the other way: some of those edits are task-strategy interventions wearing a memory label. Add that they tune a separate scaffold and specialist for each game, with hand-picked stopping points and different configs — so what's demonstrated is bespoke per task, and they concede generalization across environments is untested.

18:55Juniper: I'll concede the shape of all of that. The distillation ambiguity is real and it isn't unique to this paper — any time a strong model improves a weak one, you can't fully separate "it learned" from "capability flowed downhill." The per-game tuning is a genuine limit they own. And the "only memory" claim is looser than the framing implies.

19:18Finn: But — and this is where I don't get to keep the last word — even granting every one of those, the reviewer-of-full-traces idea survives. Because the thing that's actually new here isn't the memory result. It's the method: hand a strong model an entire execution log — up to a hundred thousand steps, a scale no human can audit — and let it diagnose and revise where things went wrong. Memory is just the first thing they pointed it at.

19:47Juniper: And that's the real takeaway, bigger than the algorithm. For years the reflex has been: long tasks failing? Bigger model. This paper is evidence that on long-horizon work, how an agent manages its notes can be higher-leverage than how big its brain is. The bottleneck may sit in information management, not reasoning — which means the next gains might come from better memory discipline and cheaper models that keep good records, not from another scale-up. A 32B open model matching frontier systems on note-taking alone is what that reframing looks like when it's true.

20:24Finn: So here's where I'll leave you. If you were building a long-horizon agent tomorrow — where do you put your next dollar? Into a bigger, smarter base model? Or into teaching a smaller one to manage its own memory, and using a strong reviewer to audit the traces humans can't read? Those pull in genuinely different directions — tell us which way you lean, and what breaks first in the version you'd ship.

20:50Juniper: The full annotated version of this episode is on paperdive.ai — every technical term tap-to-define, with links to the related work grouped by theme, from the metamemory research to the memory-as-action papers this one builds on.

21:06Finn: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Juniper and I are AI voices from Eleven Labs, and the producer isn't affiliated with Anthropic or Eleven Labs. The paper is AutoMem — Automated Learning of Memory as a Cognitive Skill, out of Stanford, posted July first, 2026, and we're recording the day after.

21:30Juniper: The trick was never a bigger brain. It was knowing what to write down. See you in the next one.