All episodes
Episode 110 · Jun 03, 2026 · 27 min

How an Agent Got 44 Points Better by Mining Its Own Scratch Paper

Lei, Yan, Momo et al.

LLM Agents Automated Reasoning Program Synthesis
AI Papers: A Deep Dive — Episode 110: How an Agent Got 44 Points Better by Mining Its Own Scratch Paper — cover art
paperdive.ai
Ep. 110
How an Agent Got 44 Points Better by Mining Its Own Scratch Paper
0:00
27 min
Paper
Inducing Reasoning Primitives from Agent Traces
Venue
arXiv:2606.02994
Year
2026
Read the paper
arxiv.org/abs/2606.02994
Also available on
Apple Podcasts Spotify

An AI that solved a hard legal-reasoning task only 30% of the time jumped to 74% — using nothing but its own past successful transcripts, with zero retraining. This episode unpacks why that isn't a free lunch, the clever control experiment that proves it, and the honest places where the whole method falls apart.

What you'll take away

  • Why mining an 's own successful 'thoughts' — not its actions — can convert inconsistent competence into consistent competence without changing a single
  • The '' mechanism: how a stable consensus recipe of the 's best behavior dissolves the apparent paradox of beating its own teacher
  • Why the control (20x more compute via ) fails to close the gap — proving it's better-organized reasoning, not just more thinking
  • Where the method breaks: arithmetic-heavy tasks where language-model '' compound small errors and drop below plain
  • The honest caveats — a curated benchmark, 'surpasses' meaning 'matches' on most tasks, and the headline +44 partly reflecting how broken the baseline was
  • Why human-readable induced tools make the 's reasoning vocabulary auditable and editable, unlike invisible

Chapters

  1. 00:00The 30-to-74 jump that looks like a free lunch
  2. 03:24The scratch paper problem
  3. 06:48The four-stage induction pipeline
  4. 10:12Pseudo-tools and the colleague-down-the-hall trick
  5. 13:36Implicit aggregation: why it beats its own source
  6. 17:00The compute objection and the Self-Consistency control
  7. 20:25Where it breaks: arithmetic, curation, and modest gains
  8. 23:49Auditable competence and the bigger reframe

References in this episode

Also available as a plain-text transcript page.

0:00Bella: Here's a number that stopped me cold. You give an AI a genuinely hard task — decide whether a proposed NBA roster move is legal under the league's collective bargaining agreement. And to make it brutal, the researchers paste about five kilobytes of the actual legal text into every single prompt. A wall of contract language. The agent reads it, reasons through it, and gets the answer right thirty percent of the time. Then the researchers do something almost cheeky. They take that same agent's own successful transcripts — its notes from the problems it got right — mine them, build a little toolkit, and hand it back to the agent. Same model. No retraining. Not one changed. And the score jumps to seventy-four percent.

0:47Eric: Seventy-four up from thirty — built out of nothing but the 's own previous work. That's the thing I want to keep poking at, Bella, because on its face that sounds like getting something for free. The student beats the teacher using only the teacher's old homework.

1:05Bella: And that free-lunch suspicion is exactly the right instinct to bring in here — it's the instinct the whole paper is built to answer. The paper is called "Inducing Reasoning Primitives from Agent Traces," out of Carnegie Mellon, and it went up on on June second, twenty-twenty-six. We're recording the very next day, June third. Quick flag before we dig in: this episode is AI-generated. The script was written by Anthropic's , and the two voices you're hearing — I'm Bella, and that's Eric — we're both AI voices from Eleven Labs. The team producing the show has no affiliation with Anthropic or with Eleven Labs. And the reason that thirty-to-seventy-four jump isn't a free lunch — that's the entire spine of this thing. So let me set up the puzzle that started it.

1:56Eric: Please. Because right now it does smell like a magic trick.

2:00Bella: So the setup is what's called a — that's the dominant way people run language models as agents. The loop is simple. The model writes a thought, a sentence of plain reasoning. Then it takes an action — looks something up, calls a tool. Then it reads the result. Then it thinks again. Thought, action, observation, repeat, until it decides it's done. Now picture that agent grinding through a whole family of related problems — say, a stack of murder mysteries it has to solve by deduction. On the first one, somewhere in its , it does a move like "check whether this suspect's alibi lines up with the timeline." Then it solves the next mystery and reinvents that exact same move from scratch. And the next. The recurring reasoning routines are right there, visible if you read across the transcripts — but they're trapped in these transient scratchpads, thrown away the instant each problem is solved.

2:59Eric: It's the student who works the problem on scratch paper, gets the answer, and tosses the scratch paper — even though the method on that page was the valuable part.

3:10Bella: Exactly that. And the standard fix has always been ugly. You get a human expert to sit down and hand-author a decomposition for the task: first parse the problem, then plan, then check. But that takes real domain insight, a fresh engineering pass for every new benchmark, and — here's the kicker — for genuinely reasoning-heavy tasks, sometimes there's no clean code you could even write. There's no deterministic function a developer can implement called "verify alibi consistency." Alibi-checking is a judgment call, not arithmetic.

3:44Eric: Right, so you can't just compile the reasoning into a Python function and be done.

3:49Bella: No. So the authors ask: can we recover those recurring reasoning moves automatically, from the 's own traces, with no human designing anything? And the sharper version — if we extract the agent's own habits and hand them back as named tools, does the agent get better at its own job? The answer, and it's the heart of the whole paper, is that it gets dramatically better. It beats the very agent whose traces built it.

4:18Eric: Okay. So before you tell me how — let me state the one-sentence version so I know what we're testing. Mine the 's successful transcripts, find the thinking moves it keeps reinventing, hand those moves back as named reusable tools, and the agent solves new problems far better than it did alone. Up to forty-four points better. That's the claim.

4:42Bella: That's the claim. And the method is almost aggressively minimal, which I think is part of why it's convincing. Two tunable numbers, three prompts to a language model, one configuration used unchanged across every single benchmark. The only thing that changes per task is a one-line description. Let me walk the , because each step's restraint is doing real work. It's a four-stage assembly line. Stage one: run a bare-bones on the training problems. And critically — this agent only has generic tools. Substring search, section search over the problem text. Nothing domain-specific. So if any real structure shows up later, it has to have come from the agent's reasoning, not from tools an engineer pre-installed. Then they keep only the runs that got the right answer. Success is the supervision signal.

5:37Eric: So the filter is just: did you actually solve it. Throw away the failures.

5:42Bella: Right. Stage two is the one that defines the paper. They throw away all the actions and all the observations — everything the did in the environment — and keep only the thought strings. The raw natural-language reasoning. They're not mining what the agent did. They're mining what it thought. Stage three: one call to a language model labels each thought with a short reasoning-move name. Three to six words. "Verify alibi consistency." "Initialize suspect analysis." And if there end up being more than ten distinct labels, a second call clusters the synonyms down to a handful of canonical moves. That label-merge is the only redundancy-reduction in the entire system. No , no fancy clustering — just a model grouping near-duplicate names.

6:31Eric: That is almost suspiciously simple. Where's the rest of the machinery?

6:36Bella: There isn't any — that's the point. Stage four: sort the canonical moves by how often they appear, and for each move that shows up at least three times, sample up to five example thoughts and ask a model to write a tool name and a description for it. Stop once you've got five tools. And here's the lovely concrete picture from the murder-mystery case. Out of a hundred ninety-six thoughts, the moves sort into buckets — investigating evidence shows up forty-eight times, initializing suspects thirty-eight, drawing a final conclusion twenty-nine. The most frequent one becomes a tool called, simply, investigate evidence.

7:16Eric: So you literally just count the verbs of thought and name the top five.

7:21Bella: Count the verbs of thought, name the top five. But now we hit the trick that makes the whole thing possible, and I want to be careful with it because it's where people's mental model usually breaks. These tools aren't real code.

7:36Eric: Wait — what do you mean they're not code? You said tools.

7:40Bella: I'm calling them tools because to the they look exactly like tools — they have a name, a type signature, a description. But the body is empty. When the agent calls one, the description and the inputs just get bundled into a fresh prompt and handed to a language model, which improvises the answer on the spot. The authors call these .

8:04Eric: So a normal tool is a vending machine — press the button, get the same thing every time, mechanically.

8:11Bella: And a is more like a sticky note that says "ask the sharp colleague down the hall to check this alibi for consistency." There's no machine doing it. You're routing the request to a smart improviser — but you've handed them a clear, stable job description. That's what lets the library contain something like "verify alibi consistency," a move with real semantic content and no clean code behind it. It's the bridge between callable and named on one side, and fuzzy judgment-laden reasoning on the other.

8:45Eric: Okay, but now I'm right back to my opening complaint, and it's gotten worse. At test time the is running the same kind of loop, calling tools that are just... the same model improvising. So calling "verify alibi consistency" is, what, asking the model to do the exact thing it already did in its own traces? How does asking it to repeat itself make it better? That should be a wash at best.

9:12Bella: And that — that question — is the whole paper. The answer is a mechanism they call , and once it clicks, the paradox dissolves. Think of a chef who cooks the same sauce a hundred nights in a row. Slightly different every night. Some nights it's brilliant. Some nights it's off. High variance. Now imagine someone watches all hundred nights, picks out the best versions, and writes down one clean recipe that captures what the good nights had in common. From then on, the chef follows that recipe instead of improvising from scratch.

9:50Eric: The chef didn't learn a new dish.

9:52Bella: The chef learned nothing new. The recipe just locked in their own best behavior. That's the move. When the source reasons through a single problem, it reconstructs each thinking move on the fly, under whatever messy local context that one transcript happened to have. Sometimes it nails the move, sometimes it botches it. It's high-variance. But during induction, the synthesis step gets to look at several successful examples of the same move at once and write a single, stable, corpus-level specification of what that move is supposed to accomplish — independent of how any one transcript happened to execute it. So at deployment, when the agent calls that move by name, it's not asking itself to repeat a one-off. It's invoking a cleaned-up consensus version of its own best behavior. And when the source agent is brittle — when the gap between "reinvent it every time" and "invoke the stable version" is large — the payoff is enormous.

10:54Eric: So the lift isn't new . It's converting inconsistent competence into consistent competence. The could always check alibis — it just did it like a forty-percent free-throw shooter under pressure. The shot was always in the repertoire. The reliability wasn't.

11:13Bella: That's the cleanest way I've heard it put. And there's a line in the paper that nails the thesis — that trace induction isn't merely ; it can surface reasoning structure the source generated inconsistently but failed to deploy reliably. The structure was always there. The agent just couldn't be trusted to reach for it.

11:35Eric: And that's why the NBA number is the showcase. The has to find the right clause in a fifty-page contract that's dumped in front of it fresh every single time. Some days it finds the clause fast and applies it right. Some days it grabs the wrong one. The induced library encodes the move — find-and-apply-the-relevant-rule — once, and runs it the same way every time. The contract is still huge. But navigating it stops being reinvented per question.

12:04Bella: And that's where the plus-forty-four comes from — the biggest gains show up exactly where the baseline was most erratic. They see the same shape elsewhere, too. Team-allocation jumps thirty-eight to sixty-eight. Meeting-planning goes from seven percent — basically hopeless — to twenty-nine. More than quadrupled, off a starting point where the could barely do the task at all.

12:29Eric: Okay. Here's where I have to put my foot down, because there's an objection so obvious it would be malpractice not to raise it. This induced is running a full loop, calling tools, each call is another trip to the language model. It's spending way more compute than a plain model that just thinks once and answers. So maybe none of this is about the library at all. Maybe you just gave it twenty times the thinking budget and of course it does better. More , better answers. The structure could be a sideshow.

13:03Bella: It's the first thing a careful reader thinks, and the authors clearly knew it. So Eric, tell people what they ran — because this is the control that earns the whole paper its credibility.

13:15Eric: They ran the cheapest possible version of "just spend more compute." It's called . You take plain — the model just thinks it through once — and you sample it twenty independent times, then take the across all twenty answers. That's roughly twenty-one times the compute, right in the same ballpark as the induced 's budget. Same model. It's the study-by-brute-force approach: the student takes the same exam twenty times and submits their most common answer to each question. And it doesn't close the gap. On the murder task, twenty tries with majority vote gets to sixty-six percent — still nine points below the induced library. On the object-placement task it's worse, about twenty-four points below. Twenty-fold more thinking does not reproduce the lift.

14:11Bella: Which is the cleanest possible separation of two explanations. More effort versus better method. If brute repetition closed the gap, you'd say fine, it's just compute. It doesn't. So something about the content of the library matters — not just the spend.

14:27Eric: And that's the moment I stopped reading this as a magic trick. Because that's the experiment that distinguishes "the thought harder" from "the agent thought better-organized." They're not the same thing, and the control proves it.

14:43Bella: So let me give the field-level reframe here, because I think it's the bigger intellectual contribution. There's a long-running question in AI about how systems improve over time. One camp says: retrain — change the with more data. Another says: leave the model , wrap clever structure around it. This paper is firmly in the second camp, but it pushes it somewhere specific. Instead of a human designing the scaffolding, the scaffolding is discovered from the 's own behavior. The reframe is treating the agent's transcripts not as disposable byproduct, but as a record of latent expertise the agent already has and underuses.

15:26Eric: Self-improvement without . Not "make the model smarter" — "make the model reliably use what it already knows."

15:34Bella: And there's a closely related method worth naming as the contrast, because it sharpens what's distinctive here. The nearest cousin is a technique called Agent Workflow Memory. It also mines successful traces — but it extracts whole workflows. A five-to-ten step natural-language procedure: here's the recipe that worked, follow it. This paper goes finer. It extracts the atomic moves — the verbs of thought — and treats them as a vocabulary the can recombine.

16:05Eric: A recipe to follow versus a vocabulary to compose.

16:08Bella: That's the whole contrast in one line. And it pays off on cost, too. The induced method runs about twenty-four percent cheaper than the workflow approach. Though there's a genuinely wild number buried in that comparison — on the NBA task, the workflow method burns around seven hundred seventy-six thousand per single question.

16:30Eric: Three-quarters of a million to answer one roster question?

16:35Bella: Because that five-kilobyte chunk of legal text gets reloaded at every step of a long loop. It just keeps re-paying for the wall of contract, over and over.

16:45Eric: That's almost poetic. The expensive method is expensive precisely because it can't stop re-reading the contract — which is the exact brittleness the cheap method fixed.

16:57Bella: Eric, I think now's the right moment for you to take the floor on where this thing breaks. Because the authors are unusually honest about it, and I don't want us to oversell.

17:08Eric: Yeah, and I want to lead with the most important one, because it's also the most clarifying. Arithmetic kills it. Remember, every tool is a language model interpreting a description. That's great for fuzzy judgment. It's terrible for long deterministic arithmetic — a baggage-fee calculation, a tax computation — where small per-step errors compound. On those tasks, the induced library actually drops below plain . It's worse than doing nothing fancy.

17:41Bella: Which is exactly what the vending-machine-versus-colleague image predicts, right? The improvising colleague is the same kind of model that makes mistakes. For anything that needs exact, repeatable computation, you want the vending machine.

17:57Eric: Precisely. And the authors sketch a fix — route the arithmetic through a real Python helper while keeping the language-model description for the fuzzy extraction part. That recovers -level performance on the airline task, gets to about fifty-one percent against an oracle ceiling of seventy-two. But "automatically deciding which moves need real code and which can stay fuzzy" — that's left as open work. So right now the method has a known, fairly large class of tasks it just can't handle. And that leads to my real concern about the headline.

18:34Bella: Go on.

18:35Eric: The benchmark suite is small and visibly curated. Six subtasks. And the authors excluded several others — an airline task, a tax task, a calendar task, a crypto-crossword task. They give principled reasons: arithmetic-heavy, or already solved by plain . Fair enough. But the practical effect is that the method gets showcased precisely on the tasks where it works, and the excluded ones are exactly where the core mechanism fails. So the honest framing is: these are real, careful results — in a favorable region the authors chose.

19:11Bella: That's fair. Though I'd push back slightly — they tell you which region. The exclusions aren't hidden; the arithmetic failure is stated outright. It's curated, but it's transparently curated.

19:24Eric: I'll grant that. Transparency about the favorable region is worth a lot. But let me press on the numbers themselves, because there's a subtlety the headline glosses. The paper says the induced library "matches or surpasses" expert hand-designed decompositions. And that's true. But when you look at their own , it significantly beats the expert design on only two of six cells — team allocation and meeting planning. On the other tasks, including NBA, the induced library is ahead, but the difference sits inside the noise. So "surpasses" is carrying a lot of rhetorical when "matches" is the more common outcome.

20:06Bella: And actually the two cases where it genuinely beats the expert are the most interesting part of that story. Team allocation and meeting planning are what the authors call soft-constraint tasks. There's no solution that satisfies every preference — somebody's getting a worse meeting time no matter what. So the reasoning isn't a clean parse-then-plan-then-check . It's a messy sequence of rearrangements and trade-offs. And those are exactly the moves a human expert can't easily anticipate in advance, but that show up naturally in successful traces.

20:43Eric: So the method beats the expert precisely where the expert can't see the structure ahead of time.

20:50Bella: Right — discovery beats design exactly when the structure is too messy to design. Plus seventeen points on team allocation, plus fifteen on meeting planning, over the human-authored versions.

21:03Eric: That I find genuinely persuasive. Two more limitations, and then I think we've been fair. One — the whole depends on the source succeeding at least sometimes. You filter for correct . On a task family where the generic agent essentially never gets it right, there are no successful traces to mine, and the whole thing has nothing to from. The paper doesn't characterize how good the source has to be for this to ignite.

21:31Bella: And there's a flip side to that which I think is the most honest caveat of all. Because induction aggregates across successful traces, it preserves systematic errors. If the source has a consistent bias — gets some category of thing reliably wrong in a way that still happens to pass the filter — that bias gets baked right into a named tool.

21:54Eric: The consensus recipe locks in the bad habit along with the good ones.

21:58Bella: It locks in whatever was consistent — and consistency is the only thing it's selecting for. Which is why the authors recommend human review before any high-stakes deployment. But here's where that limitation flips into a feature, and it's the practical point I'd most want a listener to take away. The induced tools are human-readable. A name, a type signature, a plain-language description. So a deployer can actually read the 's reasoning vocabulary. You can open up the library and see, in English, the moves the agent is relying on. If one encodes a bias, you edit the description. If you don't trust a tool's output, you delete it.

22:39Eric: Which is a real difference from or activation-, where the learned behavior is just... invisible. Buried in the .

22:48Bella: Completely invisible. Here the 's acquired competence is sitting in a small, editable, auditable list. You converted inconsistent competence into consistent competence, and you can read exactly what you converted.

23:02Eric: One last bit of due diligence, because I think it matters for trust. That big claim — induction exceeds its source — could a skeptic just say a thirty-versus-seventy-four gap is luck of which test problems got drawn?

23:17Bella: They asked exactly that, and they did the right thing. They use a . You don't just report two averages — you repeatedly resample the test set, keeping each problem's two scores yoked together, and check that the gap itself stays positive across all those resamples. Problem by problem. Every "induced minus source" gap on the comparable tasks stays strictly above zero.

23:43Eric: So it's not "looks better." It's "the difference survives wiggling the test set." Those are different sentences, and the second one is the one that should make you believe it.

23:55Bella: And it's not an artifact of one giant model, either. They reran the whole thing on a small, weaker model from a totally different family — as both the trace source and the test . Murder went thirty-three to fifty-one. Team went forty-seven to fifty-three. Smaller effects, but the direction holds. The "induction beats its source" phenomenon isn't a quirk of one having room to spare.

24:22Eric: That's the one that moves me from "neat result on one model" to "this might be a property of how these reason in general." Though I'll note the effect being smaller on the weaker model is consistent with the variance story — a weaker agent is brittle in more ways than a stable consensus recipe can rescue.

24:43Bella: That's a fair reading, and it's consistent with the whole thesis. The size of the lift tracks how erratic the starting point was. Which, if you're a skeptic, is a slightly uncomfortable corollary — the most impressive number, the plus-forty-four, comes from the least reliable baseline. The effect size is partly a measure of how broken the starting was.

25:08Eric: Right. The headline number is real, the mechanism is real — but the magnitude is partly a story about how low the floor was, not just how high the ceiling is.

25:18Bella: Which is honest, and it's why I keep coming back to the restraint of the thing as the real takeaway. In a field absolutely full of elaborate, many-stage pipelines, the recipe here is two numbers, three prompts, one fixed configuration across every benchmark, and zero retraining. And that minimal recipe matches or beats decompositions that human experts sat down and hand-engineered. The structure was latent in the 's own behavior the whole time. Nobody had to understand the domain in advance to get it out.

25:52Eric: And the one-line description they keep using captures it — the recovered structure isn't a recapitulation of the supervision signal. It's an emergent property of how a language-model reasons across instances of a task family. The agent was always doing these moves. It just never got to keep them.

26:11Bella: The scratch paper was always worth keeping. This is just the first method that picks it up off the floor, reads it, and hands it back as something the can actually reach for. That reframe — transcripts as latent expertise instead of disposable exhaust — I think that outlasts any particular benchmark number in the paper.

26:32Eric: And it leaves you with a genuinely strange picture of self-improvement. Not a smarter . The same agent, finally trusting its own best instincts on purpose instead of by accident.

26:44Bella: That's the paper. "Inducing Reasoning Primitives from Agent Traces," from the team at Carnegie Mellon. If you want to dig in yourself, the paper and a few related reads are in the show notes.

26:57Eric: And if you want to keep going, paperdive.ai has the full transcript with every technical term defined inline, plus the concept pages that link this episode over to the others we've done on and self-improvement.

27:11Bella: Thanks for spending it with us. This has been AI Papers: A Deep Dive.