All episodes

Episode 028 · May 08, 2026 · 23 min

Teaching a Model to Hire Copies of Itself: Recursive Agent Optimization

Gandhi, Chakraborty, Wang et al.

LLM Agents

AI Papers: A Deep Dive — Episode 028: Teaching a Model to Hire Copies of Itself: Recursive Agent Optimization — cover art

paperdive.ai

Listen

Ep. 028

Teaching a Model to Hire Copies of Itself: Recursive Agent Optimization

0:00

23 min

Concepts in this episode

LLM Agents Training Methods AI Efficiency & Cost Recursive Agent Optimization Reinforcement Learning Credit Assignment Multi-Agent Systems Long Context Test-Time Compute LLM-as-Judge Task Decomposition Agentic RL Policy Gradient

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Recursive Agent Optimization

Venue

arXiv:2605.06639

Year

2026

Read the paper

arxiv.org/abs/2605.06639

Also available on

Apple Podcasts Spotify

A 30-billion-parameter open model keeps pace with Claude Sonnet 4 and OpenAI's o3 on a long-context benchmark — not by being bigger, but by learning to spawn copies of itself and delegate. A new paper argues recursion shouldn't be a scaffold wrapped around frozen models; it should be a primitive the weights are actually trained to use, and the results suggest a different axis for scaling agents than bigger models or longer context.

What you'll take away

Why RAO's central move — putting recursive delegation inside the RL loop instead of around a frozen model — is the whole intellectual contribution
How rewarding average (not summed) child success teaches the model when delegation is worth it, not just how to do it
The phase-transition result on hard crafting tasks: 0% to 88% with the same 4B base model, generalizing past its training depth
How a 30B recursive agent matches Sonnet 4 and o3 on Oolong-Real despite a context window six times smaller than the inputs
Why the same trained agent fans out 85% in parallel on independent sub-tasks but serializes to 1.5% on chained ones
The honest costs: RAO is up to 18x slower in wall clock on some tasks, models are trained per task family, and the strongest results come from benchmarks whose structure suits the method

Chapters

00:00The setup: an agent that can spawn itself
02:17The Kyoto travel example
04:35Scaffold versus trained behavior
06:53Local rewards and the 'average child success' trick
09:11Baselines and variance reduction
11:28TextCraft-Synth: phase transition on hard tasks
13:46Oolong-Real: matching frontier models with a smaller window
16:04Deep Dive: when recursion can't parallelize
18:22The steelman critique
20:39What this says about scaling

References in this episode

ADaPT: As-Needed Decomposition and Planning with Language Models — An inference-time recursive decomposition system that the RAO paper positions it
Toolformer: Language Models Can Teach Themselves to Use Tools — The canonical example of the 'train models to use scaffolds, don't just prompt t
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — The original prompting trick that later became trained reasoning behavior — the
Tree of Thoughts: Deliberate Problem Solving with Large Language Models — An earlier vision of branching, tree-structured reasoning at inference time, use

Full transcript

Also available as a plain-text transcript page.

0:00Juniper: Picture a thirty-billion-parameter open-weights model going head to head with Claude Sonnet 4 and OpenAI's o3 on a long-context benchmark. And keeping up. Not because the small model is secretly bigger or smarter — but because it learned to hire copies of itself.

0:19Finn: That paper landed on arXiv on May seventh, twenty-twenty-six — we're recording the next day. Quick note up front: the show you're hearing is AI-generated. I'm Finn, that's Juniper, and we're both AI voices from Eleven Labs. The script came from Anthropic's Claude Opus 4.7, and the producer isn't affiliated with either company. The paper, in full, is "Recursive Agent Optimization." And the reason a thirty-billion-parameter model hangs with frontier systems on long context turns out to say something kind of strange about what scale is actually buying us right now.

0:59Juniper: So the paper has a tidy acronym — RAO, Recursive Agent Optimization — and a five-author lineup out of Carnegie Mellon and Amazon AGI Labs. Apurva Gandhi, sah-TYAH-kee chah-kruh-BOR-tee, shyang-JUN Wang, Aviral Kumar, and Graham NOY-big. The setup is going to sound familiar at first. You've got an LLM agent. It can think out loud, it can call tools, it can write Python. Standard stuff. The twist is one extra capability: the agent can spawn another instance of itself. Hand it a sub-task. Give it a fresh, blank context window. Wait for it to come back with a result. And the spawned copy can do the same thing — spawn its own children, who can spawn theirs. So a single task, at runtime, dynamically grows a tree of agents. Same model at every node. Different problems at every node.

1:55Finn: And before we go anywhere near the math, the picture I want listeners to have in their head is the example the authors lead with — Figure 1 in the paper. It's a Kyoto travel-planning task. Plan a three-day trip. The root agent looks at that and decides: this splits naturally. So it spawns two children in parallel. One goes off to research cherry blossom timing. The other handles itinerary logistics. The logistics child then spawns three of its own children — find a quiet temple, find a kid-friendly afternoon activity, find dinner near the Gion district. And the temple sub-task spawns yet another child to actually run the searches. Three levels deep. Branching wherever the work splits. Each node working in its own context window, returning a structured result to its parent. The shape of the tree was not pre-specified. The model decided.

2:52Juniper: One thing worth saying about how this is actually plumbed together, because the simplicity is part of the point. The agent is what's called a Python REPL agent — it interleaves chain of thought with executing actual Python code. The whole "spawn a child" capability is just one new function the agent can call. Hand it a goal, and it goes. And because that function is asynchronous, when the agent wants to launch three children in parallel it just uses ordinary Python concurrency — asyncio dot gather, the same thing any working Python developer reaches for. There's no special orchestrator. The recursion falls out of one async function and ordinary control flow.

3:36Finn: Which matters because here is where the entire intellectual contribution of the paper lives, in one distinction. Systems that do this kind of recursive delegation already exist. Claude Code has subagents. OpenAI's Codex does it. Academic systems like ADaPT and THREAD do it. In every one of those cases, recursion is a scaffold that wraps around a frozen model. The model itself was never taught when delegating helps, what to put in the brief, or how to combine results. It's the senior engineer who got promoted to manager without ever being trained how to manage. She's brilliant at solo work. She has no idea how to write a useful ticket.

4:15Juniper: What RAO does is the management training. It puts the recursive scaffold *inside* the reinforcement learning loop, so the model's weights actually learn to use it. And that one move — moving recursion from inference-time to training-time — is what unlocks the rest.

4:31Finn: That move is easy to gloss over because the diagrams look the same. The execution tree at inference looks identical whether the model was trained for it or not. The difference is not in the picture; it's in the weights. So let's be concrete about what "training for it" actually means. You roll out the agent on a task. It produces this whole tree — root, children, grandchildren. Every node generates its own trajectory of thoughts and actions and code. Every trajectory gets a score. And then you do reinforcement learning: nudge the weights so high-scoring trajectories become more likely. The tricky question is: what score do you give each node?

5:11Juniper: The obvious answer is to hand every node the root task's reward. Did the trip plan come together? Great, everyone gets a gold star. Did it fall apart? Everyone gets dinged. The problem is credit assignment. A child agent who nailed her search for a quiet Kyoto temple should not be punished because some unrelated branch of the tree mishandled the dinner reservation. You're sending noise into her gradient. You're teaching her to do worse at the thing she was actually good at.

5:45Finn: So RAO does something more careful. Each node gets a local reward — two terms. Term one: did this node solve its own assigned task? That's a direct success signal — exact verification when there's a real verifier in the environment, and an LLM-as-judge when there isn't. Term two — and this is the design choice that does the real work — a bonus tied to how well this node's immediate children did. *Average* success rate, not the sum.

6:15Juniper: That word "average" is doing a lot. Imagine you're at a company. You're evaluated on two things: did you do your own job well, and did the people you delegated to do their jobs well. If we evaluated you on the *sum* of your delegations succeeding, you'd learn to spawn ten sub-projects, hope four of them work, and farm bonus reward off the count. Mean kills that incentive. Spawn ten, get seven flops, the bonus drops. Spawn one, that one nails it, the bonus is high. It rewards delegation *quality*, not delegation *quantity*. And that's what the paper means when it talks about teaching the model not just *how* to delegate but *whether* delegation helps in the first place.

7:02Finn: There's a second piece of math worth flagging — quickly, because the intuition is more interesting than the formula. When you do this kind of policy-gradient reinforcement learning, you don't just push toward "high reward." You push toward "higher than expected reward." So you need a baseline — a reference point that turns a raw score into "better or worse than typical." The standard trick is leave-one-out: roll out the same task several times, and for each rollout, the baseline is the average of all the *other* rollouts. The thing RAO does that's slightly weird is they apply that root-task baseline to *every* node in the tree, including sub-agents working on different problems.

7:43Juniper: Including the sub-sub-agents who aren't even doing the same kind of work as the root?

7:48Finn: Yes. Every worker — manager, contractor, sub-contractor — gets graded against how the typical root-level CEO-style attempt at this task tends to go. It's an unusual choice. The authors prove it's mathematically unbiased, with an argument that hinges on independent rollouts. The practical reason they do it is cost. They don't have to construct comparison groups for every possible sub-task. The baseline is one shared reference point for the whole tree. They also note, candidly, that this probably isn't the lowest-variance baseline you could use. They're using it because it's clean and tractable, not because it's optimal.

8:26Juniper: There's a third gear in the optimization that I'll just gesture at. They reweight trajectories so the leaves of the tree don't drown out the root. A tree that branches widely at depth three might have eighty leaf trajectories and five root ones. If you just pool them, the gradient becomes mostly leaf-shaped. They correct for that with depth-level weighting. Two sentences of intuition; the equations are doing the same job.

8:51Finn: So the recipe is: a single shared model that can spawn copies of itself in an async Python harness. A local reward at every node mixing own-task success with average child success. And a cleanly chosen baseline for variance reduction. Same parameters trained across the whole tree. Worth pausing on the model sizes here, because the paper actually uses different bases for different experiments. For the synthetic crafting benchmark and for the multi-hop research benchmark, the base model is a four-billion-parameter chwen instruct model — that's the small one. The thirty-billion-parameter chwen model, the one I keep mentioning, only shows up on the long-context Oolong-Real benchmark. So when we get to the headline numbers, keep that split in mind. Now, what does it actually buy you?

9:41Juniper: This is where it gets fun. The first benchmark is something the authors built themselves — a synthetic Minecraft-style crafting task they call TextCraft-Synth. You can dial the recursion depth up and down to control difficulty. So you can train on medium-difficulty crafts and then test on much harder ones. This is the four-billion-parameter setup, to be clear. Across all tasks, with an eight-thousand-token context window, the single non-recursive agent — same four-billion base model — solves about twenty-four percent. The recursive agent solves ninety-five. And when you zoom in on the *hard* tasks specifically, the ones where the single agent's context simply isn't big enough to hold the necessary state, it goes from zero percent to eighty-eight.

10:30Finn: Zero to eighty-eight on the same four-billion base model — that's a phase transition, not an improvement curve.

10:38Juniper: And the model was only trained on medium tasks. Hard tasks are out of distribution. The recursive agent generalizes to them by recursively decomposing — the depth of the trees it generates grows with task difficulty. Easy tasks max out around one to three levels. Medium, three to six. Hard, five to ten. The training cap was depth six. At inference, on the hardest tasks, the agent goes to depth ten on its own. It learned not just *how* to delegate, but *how much*.

11:11Finn: OK but the TextCraft-Synth result has an asterisk on it, Juniper — it's a benchmark the authors designed themselves, with deliberately tunable recursive structure. A skeptic is going to say: of course a recursive method wins on a benchmark engineered to expose recursive structure. Fair point. The more interesting result is on an external benchmark called Oolong-Real. Oolong-Real is a long-context aggregation task — synthesize information across very long Dungeons and Dragons session transcripts. Inputs run from fifty-five thousand tokens up to two hundred and twenty thousand. The recursive agent's training context is capped at thirty-two thousand. Some of these inputs are seven times longer than what the agent has ever seen during training. And this is the experiment that uses the thirty-billion-parameter chwen model — sparse mixture-of-experts variant — as the base.

12:14Juniper: Which on its face should be a disaster.

12:16Finn: It's not. The recursive agent scores about thirty-two percent average reward. Claude Sonnet 4 — given the full input in its native long context — scores thirty-seven. OpenAI's o3 scores thirty-seven. GPT-5-mini scores thirty-five. A thirty-billion-parameter open model, trained to chunk and delegate, lands in the same neighborhood as frontier closed models brute-forcing the long context. On the same task. With less than a sixth of the input window the frontier models are using.

12:51Juniper: Now, Finn, that comparison deserves some honest framing. The single-agent baseline they report against — the same thirty-billion-parameter model without recursion — has a thirty-two-thousand-token context, and the inputs are bigger than that. Of course it loses. The paper acknowledges this is a test of whether recursion overcomes context limits. It is not a test of whether recursion beats a fairly-resourced single agent of the same size. But the frontier comparison is genuinely a frontier comparison. Sonnet and o3 had the full input. The recursive thirty-billion model did not. And it kept up.

13:30Finn: There's a moment in the training curve on Oolong-Real that I want to flag, because it's one of those small storytelling details that tells you something about how the system actually works. Around training steps forty to eighty, the model briefly learns the wrong strategy. It starts copying the entire long input into the root agent's context — defeating the whole point of delegation, exhausting its window, failing the task. The training curve dips. And then, without intervention, it climbs back out. It rediscovers chunking. It relearns to delegate. The wrong strategy is unstable under the reward signal, and the right one re-emerges on its own.

14:11Juniper: There's something quietly satisfying about that. The reward landscape is shaped such that the bad strategy is a local distraction, not a trap.

14:20Finn: Now — third benchmark, and it tells a very different story about parallelism. Deep dive is a multi-hop research benchmark, and we're back to the four-billion base model here. To answer the question, you need to do search hop one, then use that answer to formulate hop two, then use *that* to formulate hop three. The sub-problems are sequentially dependent. You cannot parallelize a relay race. You can only parallelize an assembly line.

14:48Juniper: And this is where one of my favorite numbers in the paper shows up. On TextCraft-Synth, where sub-problems are mostly independent — gather wood, gather stone, in parallel — almost eighty-five percent of the recursive agent's sub-agent calls happen concurrently. On deep dive, where sub-problems chain, only one and a half percent are concurrent. The agent figured out which regime it was in. Same model. Same harness. It learned to fan out when work could fan out, and to serialize when it couldn't.

15:22Finn: And the consequence is that on hard TextCraft-Synth tasks, the recursive agent is roughly two and a half times faster in wall clock than the single agent — even though it's executing nearly three times as many total steps. The parallelism wins. But on deep dive specifically, on tasks both agents solve, the recursive version is about *eighteen times* slower than the single agent.

15:48Juniper: Wait — slower how?

15:50Finn: Slower in wall clock. It's executing way more total work — more agent invocations, more tokens, more reasoning. And it's getting more questions right. Success rate on deep dive — again, four-billion base model — goes from twenty-four percent for the single agent to forty percent for the recursive one. So it's a real capability gain. But the cost is real too. You are paying for the right answers in compute.

16:18Juniper: Which I think is the right place to start the honest accounting. Because the wall-clock speedup on parallelizable tasks is a beautiful headline. It's true. But the fuller picture is that recursion is a lever that trades model size and context length against agent invocations and total compute. Sometimes the trade is overwhelmingly worth it. Sometimes you're just buying correctness with a lot more compute.

16:46Finn: And the paper is — I want to give them credit here — pretty unusually candid about this. The conclusion explicitly raises the cost question. They write something like: how should we design surrogate sampling procedures when full recursive rollouts are too expensive to complete inside the RL loop? They're flagging that even *they* found training expensive. Their efficiency claims are measured in optimization steps, not in compute. Compute-equivalent comparisons against single-agent training would be a fairer test, and that test is not in the paper.

17:23Juniper: This is one of three or four limitations the authors lay out themselves, and they do it cleanly. Worth running through. One: RAO works best on hard or long-horizon tasks. They include an appendix experiment on an easier email-search benchmark, and on that one, recursive and single agents converge to similar performance with enough training. Recursion is not free. It's a tool whose value depends on task structure. Two: each model is trained for one task family. The TextCraft-Synth model is its own model. The Oolong-Real model is its own model. There is no generalist recursive agent here. Cross-domain generalization is named as future work. Three: their experiments mostly involve sub-tasks that are smaller versions of the parent task. Real software engineering involves *qualitatively* different sub-tasks — debugging is not a smaller version of synthesis. That heterogeneity is untested.

18:24Finn: All right, Juniper, let me put my skeptic's hat on for a minute, because I think the steelman has a couple of edges the authors don't quite address. The strongest version starts with how the benchmarks were chosen. TextCraft-Synth is theirs. They built it specifically to have controllable recursive depth and crafting structure. So the strongest evidence for their method comes from a benchmark designed to expose the structure their method exploits. Oolong-Real is external — that's a real point in their favor — but Oolong-Real is *specifically* a long-context aggregation task. Chunking is the obvious strategy. The frontier-model comparison rests on that one task type. The result is impressive on its own terms. But "thirty-billion model matches frontier" is not the same as "thirty-billion model matches frontier in general." It's "thirty-billion model matches frontier on the kind of task where chunking is the obvious right answer."

19:23Juniper: Yeah. And that connects to something subtle in the reward design. On Oolong-Real and deep dive, the reward signal for sub-agent success is coming from GPT-5-mini acting as a judge. Which means the recursive agent is being trained on a signal that reflects what GPT-5-mini thinks good delegation looks like. If GPT-5-mini systematically over-rewards patterns that resemble its own behavior — and there's some evidence judges do that — then the training signal isn't *quite* measuring what the paper claims it measures. That's not damning. It's a confound worth naming.

19:59Finn: A weaker but real critique: the leave-one-out baseline applied across the tree, the one we walked through earlier — the authors themselves note that this probably isn't the lowest-variance choice. They picked it for tractability. And the paper does not compare it against alternative variance-reduction schemes. So we don't know if a better baseline would matter a lot, a little, or not at all.

20:24Juniper: All right — and I think with all of that voiced, the broader picture comes out cleaner. There's been a dominant story about scaling AI agents over the last couple of years, and that story has two axes: bigger models, longer context windows. Both are real levers. Both are getting expensive. And both show diminishing returns on tasks where the bottleneck isn't *information* but *organizing the work*. This paper is a vote in a different direction. Train the model to manage itself. Make recursion a primitive the weights actually understand. And what you get back are capabilities that you genuinely cannot get from scale alone — solving tasks longer than your context window, generalizing from medium training to hard inference, and exploiting parallelism on decomposable problems.

21:14Finn: There's a line in the paper's conclusion that I think captures the meta-point well. Inference-time scaffolds should not merely be designed around models. Models should be trained to use them. That's a generalizable principle, and we've already seen it play out twice. Chain of thought started as a prompting trick — let's think step by step — and became reasoning RL. Tool use started as a scaffold and became trained behavior. Recursive delegation is the next item on that list. RAO is one instance. The principle generalizes.

21:48Juniper: It also reframes what "test-time compute scaling" can mean. The current dominant version is: spend more compute per query by making the model think longer in a single context. RAO is a different version: spend more compute per query by spawning a tree of agents, each thinking in its own scope. You're trading model invocations for context per invocation. And it turns out that trade unlocks behaviors the other knob can't reach.

22:16Finn: The question I'm left holding, Juniper, is the cross-domain one. Right now, every RAO model is a specialist. Train it for crafting, it's good at crafting. Train it for D-and-D transcripts, it's good at D-and-D transcripts. The really interesting frontier — the one Claude Code and Codex are already in production for — is generalist agents that delegate across many kinds of work. RAO doesn't quite get there yet. But it tells us, pretty convincingly, that whatever does get there will need to have been *trained* for delegation, not just prompted into it.

22:51Juniper: Show notes have a link to the paper and to related materials if this episode caught you. From AI Papers: A Deep Dive — thanks for listening.

Teaching a Model to Hire Copies of Itself: Recursive Agent Optimization

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes