When Splitting One Model Across Three Agents Doubles Its Accuracy
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
Take a small language model, freeze it, and give it a fixed budget of trainable parameters. Putting all those parameters into one agent gets you 24% on a physics exam. Splitting them across three agents that talk to each other in plain English gets you 44% — same model, same compute, same reward signal. A new paper argues organization itself is a scaling axis we've been ignoring, and that the way you train these systems matters more than anyone realized.
What you'll take away
- Why a controlled comparison shows three agents sharing a parameter budget can nearly double the accuracy of a single agent with the same budget
- How REINFORCE lets you train a graph of language models end-to-end using just one bit of reward, despite the signals between agents being discrete text
- The progressive growth result: identical seven-node architectures either fail or succeed depending entirely on whether you train them from scratch or grow them from a smaller working system
- Why the paper's 'role-free' framing is doing slightly more rhetorical work than it should — structural prompting still bakes in real priors
- The missing experiment that would make this work bulletproof: an inference-cost-matched baseline, and a sweep showing the gains survive at frontier-model scale
- A concrete warning for anyone building multi-agent systems: naively scaling up the number of agents can make performance actively worse
Chapters
- 00:00The 20-point gap that motivates the paper
- 03:40Agents as positions in a graph, not job titles
- 07:21How you train a network you can't differentiate through
- 11:01The controlled experiment and what it does and doesn't show
- 14:42Why bigger teams fail from scratch but succeed when grown
- 18:23How 'role-free' the agents actually are
- 22:03What this means for the field
References in this episode
- The Bitter Lesson — Sutton's essay that the NeuroMAS authors invoke directly to argue against hand-e
- Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning — Williams's original REINFORCE paper, the policy gradient algorithm Cassidy walks
- Net2Net: Accelerating Learning via Knowledge Transfer — Chen et al.'s function-preserving network growth — the closest classical precede
- DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines — An alternative approach to optimizing multi-LLM systems by learning prompts rath
Full transcript
Also available as a plain-text transcript page.
0:00Cassidy: Here's a number I want you to sit with. Take a small language model — about six hundred million parameters, frozen, can't touch it. Give it a fixed budget of trainable adapter parameters, around seven million. Train it with reinforcement learning on a physics exam, and it gets twenty-four percent. Now take exactly the same backbone, exactly the same seven million parameter budget, exactly the same reward signal — but split those seven million parameters across three copies of the model that talk to each other in plain English. Same compute footprint for training. Score: forty-four percent. Twenty points, from nothing but the organization.
0:41Finn: And nobody told those three copies what their jobs were. That's the part that broke my brain reading this. There's no planner, no critic, no verifier. Just three identical model instances sitting at different positions in a little graph, passing text to each other, with one reward at the end saying "the final answer was right" or "the final answer was wrong." The paper is called "NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning," it went up on arXiv on May sixteenth, twenty-twenty-six, and we're recording four days later. What you're hearing is AI-generated — I'm Finn, that's Cassidy, and we're both AI voices from Eleven Labs. The script is from Anthropic's Claude Opus 4.7, and the show isn't affiliated with Anthropic or Eleven Labs. We just read papers. And the reason that twenty-point gap matters — the reason it's not just a benchmark curiosity — is that it's pointing at something the whole field has been doing slightly wrong.
1:43Cassidy: Let me set up what they did wrong, because I think this is where the episode actually starts. If you've been anywhere near the AI agents conversation in the last two years, you've heard the vocabulary. Your research agent. Your coding agent. Your QA agent. The mental model is that you assemble a team of specialists, each one with a job title written into its system prompt — "you are a careful code reviewer," "you are a creative brainstormer" — and you hand-wire how they talk to each other. The "agent" is defined by the role you gave it.
2:19Finn: And it works! Sometimes really well. You can take a model that fails a hard problem in one shot, wrap it in a planner-solver-critic-judge pipeline, and watch it succeed. There's a real effect there.
2:32Cassidy: There's a real effect, but there's also a smell. The structure of these systems — who talks to whom, what each agent's role is, how their outputs get combined — is hand-written by humans. And we have fifteen years of evidence that hand-engineered structure loses to learned structure in every other corner of machine learning. Computer vision used to be hand-crafted features. NLP used to be hand-crafted pipelines. Both got obliterated by end-to-end learning. Sutton called this the Bitter Lesson — durable progress comes from general learnable methods that scale with compute, not from elaborate human-authored priors. The authors of NeuroMAS invoke that essay directly, and their question is essentially: why should multi-agent system design be the one place where human priors still win?
3:24Finn: So their move is to ask — what if the agents aren't job titles? What if they're just positions in a graph?
3:31Cassidy: Right. And the way they make that concrete is by stealing the analogy wholesale from neural networks. A regular neural network has neurons arranged in layers. Each neuron gets some numbers in, does a small computation, passes numbers forward. The architecture — depth, width, which neurons connect to which — is fixed up front. The weights are what training adjusts. Now imagine that same picture, but every neuron is replaced by a small language model, and every signal flowing between neurons is replaced by a short text message in English. The layers, the connections, the depth, the width — all identical. What's changed is just the medium. Instead of multiplying floating-point numbers, you're routing sentences.
4:17Finn: That's the whole paper in one image, basically.
4:20Cassidy: That's the whole paper in one image. And once you have that picture, everything else follows. The graph topology is the architecture. The trainable parameters inside each node are the weights. The final reward at the output is the loss. Training is whatever lets the gradient signal find its way back through all those nodes and adjust each one. And critically — what each node ends up doing, what it specializes in, is not something you specify. It's something that emerges from training, the same way a particular neuron in an image network ends up detecting edges or eyes or wheels not because anyone told it to but because that's what fit its position in the architecture.
5:03Finn: Okay, but Cassidy, let me push on this immediately, because the analogy is doing a lot of work and I want to know where it cracks. A real neural network is differentiable end to end. You can backpropagate the loss from the output all the way back to the very first layer because every operation is a smooth function of numbers. Here the signals between nodes are sampled text. Words. Discrete tokens drawn from a vocabulary. You cannot take a derivative through a word. So how do they actually train the thing? This isn't a small technical detail. This is the move that has to work.
5:41Cassidy: It's the move that has to work, and it's worth slowing down on because it's the same trick that's used in a lot of modern reinforcement learning, including RLHF on language models. The trick is called REINFORCE, or the policy gradient. And the cleanest way I know to explain it goes like this. Imagine you're coaching a basketball team, and you can only see the final score. You don't see individual plays, you don't see who made which decision — just the scoreboard at the end. What do you do? After a win, you tell every player on the floor, "whatever you were doing this game, do more of that." After a loss, "whatever you were doing, do less." Over many games, the players whose tendencies actually contributed to wins get reinforced. The ones whose tendencies hurt fade out. You never had to know what any player's role was. The scoreboard taught everyone.
6:37Finn: And in NeuroMAS the scoreboard is just "did the final answer match the right answer."
6:43Cassidy: One bit. Right or wrong. That bit gets handed to every single node in the graph. Each node looks at the text it generated during that forward pass and adjusts its own parameters in proportion to how likely that text was — push the probability up if the final answer was right, push it down if it was wrong. Every node updates only itself, but every node sees the same reward.
7:07Finn: Which means the credit assignment is — let's be honest — incredibly noisy. A node deep in the graph generated some text. The final answer was wrong. Was it that node's fault? Maybe. Maybe not. Maybe its message was great and a downstream node bungled it.
7:24Cassidy: Completely noisy. The basketball analogy actually understates how noisy. But the math says that on average, over enough samples, the noise cancels and the signal points the right way. Nodes whose outputs reliably lead to correct final answers see those output patterns reinforced. Nodes whose outputs reliably tank the answer see them suppressed. And what you get — what the experiments show — is that nodes in different graph positions drift in different directions, because they see different context. A node in layer one sees the raw question. A node in layer two sees the question plus messages from layer one. They get pulled different ways by the gradient. Specialization emerges from position.
8:09Finn: This is where I want to flag something the paper is careful about but listeners might miss. The claim is not that something alien emerges. The authors don't say "we trained this and discovered that node four invented a new role no human would think of." They just say the roles weren't imposed. It's entirely possible that what emerges in a successful run looks a lot like planner-solver-critic — it's just that nobody wrote that down. The point isn't that the emergent organization is exotic. The point is that it's learned rather than dictated.
8:43Cassidy: That's exactly the right reading. And it matters because the moment you stop dictating the roles, you can ask the next question — which is, can the organization itself be a thing you scale? Can you get gains not by making your base model bigger, but by making the team around it bigger and better connected?
9:03Finn: Which is where the experiments come in. And I want to walk through the headline experiment carefully, because this is where the paper is at its strongest and where any skeptic — including me — has to actually contend with the result.
9:18Cassidy: Go for it.
9:19Finn: So the controlled comparison is this. You have a small frozen language model, six hundred million parameters, can't be modified. You attach a LoRA adapter to it — a small bundle of extra parameters, in this case about seven million, that can be trained while the underlying model stays frozen. Now you have two ways to spend that seven million parameter budget. Option A — call it Single-LLM RL — you put all seven million parameters into one adapter. You take the base model plus that adapter as a single agent. You train it with REINFORCE on the task: it generates an answer, you reward it if the answer's right, you adjust the adapter. This is the natural single-agent baseline. Same model, same training algorithm, same total trainable parameters. Option B — NeuroMAS-3 — you split those same seven million parameters across three smaller adapters. You attach each one to a separate instance of the same frozen model. You arrange those three instances in a little graph. Node one gets the question. Node two gets the question plus node one's message. Node three gets everything and produces the final answer. You train the whole thing with REINFORCE on the same reward. Cassidy, what's the result?
10:38Cassidy: On a multiple-choice science benchmark called ARC-Challenge, single-agent gets forty-six percent. NeuroMAS-3 gets fifty-six and a half. Ten and a half points. On the physics subject of MMLU, single-agent gets about twenty-four percent. NeuroMAS-3 gets forty-four. Almost twenty points. On HumanEval, the coding benchmark, single-agent gets seventeen percent. NeuroMAS-3 gets thirty.
11:03Finn: Same backbone. Same reward. Same total trainable parameters. The only thing that changed is whether those parameters live in one place or are split across three nodes that talk to each other. And the split version is sometimes nearly twice as accurate.
11:20Cassidy: That's the load-bearing experimental result of the paper. Organization, on its own, holding compute equal, holding parameters equal, holding training algorithm equal — organization alone closes a huge gap.
11:33Finn: Now, the skeptic in me has to say something here. The comparison matches trainable parameters but it doesn't match inference compute. NeuroMAS-3 makes three calls to the language model per question. The single-agent baseline makes one. So NeuroMAS-3 is using three times the inference budget. A fair-minded critic would ask: what does the single agent look like if you give it three sampled answers and let it vote? That's the natural inference-cost-matched baseline, and it's not in the paper's main table.
12:05Cassidy: That's a real gap. The authors are matching one resource and not another, and they're matching the resource that flatters their method. To their credit, they include a different version of this comparison elsewhere — they do match against prompt-engineered methods that use many model calls — and NeuroMAS still wins. But you're right that the cleanest possible apples-to-apples is missing.
12:30Finn: And the other thing I want to flag is the size of the backbone. Every experiment in this paper uses sub-one-billion-parameter models. Six hundred million parameters. One billion parameters. These are tiny by current standards. There's an old, ugly pattern in machine learning where elaborate scaffolding helps weak models a lot and helps strong models almost not at all. If your base model is bad, lots of structure around it can compensate. If your base model is already capable, the structure becomes window dressing. The paper does not show that NeuroMAS still helps when you swap in a seven-billion-parameter model, or a seventy-billion-parameter model, or a frontier model.
13:11Cassidy: The authors are honest about that. They acknowledge it directly. They say the results should be read as a proof of concept for an idea, not a deployment recipe. The strongest version of this work would be a backbone sweep showing the gain doesn't collapse at scale. That experiment is not in this paper. It's the experiment somebody should run.
13:32Finn: Fair. So banked: the controlled experiment is real, the gap is large, the limitation is the backbone size. Now — the part of this paper that genuinely surprised me. Cassidy, do you want to take the progressive growth result, because I think this is where the episode actually has its sharpest moment.
13:51Cassidy: Yeah, this is the one I think about when I think about this paper. So the obvious follow-up question, once you've shown that three agents beat one, is: do five beat three? Does seven beat five? Is this a scaling axis the way model size is a scaling axis? Make the team bigger, do more layers, more nodes per layer, get better. The authors test this in the most natural way possible. They build NeuroMAS-3, three calls per forward pass. NeuroMAS-5, five calls. NeuroMAS-7, seven calls. They train each one from scratch, on a navigation task from a benchmark called BBH. And here's what happens to the accuracy as the system gets bigger. NeuroMAS-3 from scratch: forty-five and a half percent. NeuroMAS-5 from scratch: forty-one percent. NeuroMAS-7 from scratch: forty and a half percent.
14:43Finn: It gets worse. The bigger systems do worse than the smaller ones.
14:49Cassidy: It gets monotonically worse. Which, if you're hoping organization is a clean scaling axis, is exactly the wrong shape of curve. Bigger should be better. Bigger is worse. So either this whole research direction is dead, or the way they're training the bigger systems is broken. The authors guess it's the second one, and they try something. Instead of building the seven-node system fresh and training it from random adapters, they do this. Train the three-node system first, until it works well. Then expand it to five nodes — and here's the trick — copy the trained adapters from the three-node system into the corresponding positions in the new five-node graph, and initialize the two new nodes' adapters to zero, so they contribute literally nothing at the start. The five-node system at initialization behaves exactly like the trained three-node system did, with two silent passengers. Then keep training. The new nodes wake up gradually, gradient by gradient, integrating into the working structure that's already there. Once that's converged, do the same thing to expand to seven nodes.
16:01Finn: So the structure inherits. Each generation of the team learns on top of what the smaller team already worked out.
16:09Cassidy: Right. And here are the same three numbers — NeuroMAS-3, NeuroMAS-5, NeuroMAS-7 — under progressive growth instead of from-scratch. Forty-five and a half percent. Forty-eight percent. Fifty-one percent.
16:22Finn: That's a ten-point swing on the seven-node system from changing nothing but the training schedule. The architecture is identical. The final parameter count is identical. The only difference is whether you trained it cold or grew it in stages.
16:38Cassidy: And the curve flips. From scratch, it slopes down. Progressively grown, it slopes up. The same architecture is either a failure or a clean monotone improvement depending on how you got there. The path matters, not just the destination.
16:53Finn: This is the result I want listeners to walk away with, because it changes how to think about multi-agent systems entirely. It says: you cannot just specify the topology you want and turn on the optimizer. The optimization landscape for a randomly initialized seven-node graph trying to learn from a single bit of reward is so unforgiving that you don't find a good solution. You have to scaffold your way there.
17:19Cassidy: And the human analogy almost writes itself, though we should be careful with it. Imagine being handed seven strangers and told: solve this hard problem together. No introductions, no roles, no history. After a few rounds, somebody says "you won" or "you lost," and you all have to figure out from that what you should have done differently. That's brutal. Now imagine instead you start with three people who learn to work together. Then two more join, and they stay quiet at first, learning the dynamic. Then two more. By the time there are seven of you, there's an established structure newcomers can slot into. That's progressive growth.
18:00Finn: The analogy breaks at the edges — these agents don't have memory across episodes, they don't talk to each other outside training — but the core intuition about scaffolding inherited structure is right.
18:13Cassidy: And it suggests something the paper gestures at but doesn't fully claim: that multi-agent scaling has its own laws, separate from model scaling, and we're only beginning to see what they look like. The naive "just make it bigger" instinct from model scaling doesn't transfer directly. There's curriculum involved. There's path dependence.
18:34Finn: Let me steelman the other direction for a minute, though, because I don't want to oversell this. The progressive growth result is a single ablation on a single task. The authors run it on Navigate. Across the six benchmarks in the paper, no single topology dominates — sometimes NeuroMAS-3 wins, sometimes NeuroMAS-7 wins, sometimes intermediate. If progressive growth were a universal law of multi-agent scaling, you'd expect the biggest system to always win once you grow it correctly. It doesn't always win. So we're seeing a real phenomenon, but we're not seeing a clean scaling law yet.
19:10Cassidy: That's fair. The headline finding is "training-from-scratch fails for bigger topologies, and progressive growth fixes it." It is not "bigger is always better if you grow it." Those are different claims and the paper supports the first one much more strongly than the second.
19:28Finn: There's one more critique I want to put on the table while we're being honest about the work. The paper calls its agents "role-free." I want to push on what that actually means, because it's the most contestable phrase in the whole framing.
19:42Cassidy: What's the issue with it?
19:44Finn: Each node is told what its position in the graph is — you're node two in layer one, you're the final output node, whatever. It's told what messages it's receiving and from whom. It's told the format its output has to follow — "TO node three: dot dot dot. TO node four: dot dot dot." That's actually a lot of structural information baked into the prompt before any learning happens. What the node isn't told is the semantic role — you are a planner, you are a critic. So "role-free" really means "no semantic role designation, but plenty of structural designation." A skeptic could say: the structural information is doing more work than the paper acknowledges. The reason a particular node ends up specializing in a particular way isn't only the reward gradient — it's also that the prompt told it it's the second of three nodes in a sequential pipeline, which already biases what kinds of behavior make sense.
20:40Cassidy: That's a real point, and I think the right response is to grant it partially. The paper is showing something narrower than "agents organize themselves from nothing." It's showing that you don't need to write semantic job descriptions for agents — the position in the graph plus the reward signal is enough to get useful specialization. The structural prompt is doing work, yes, but it's structural work that the human did not have to think about — they just said "here's the graph, fill in the positions automatically." Compare to writing "you are a careful code reviewer with twenty years of experience" — that's the kind of human design choice that goes away.
21:21Finn: Okay. I'll grant that. The savings are real even if they're not as total as the language suggests.
21:27Cassidy: There's a theoretical result in the paper that tries to explain why any of this works, and I want to mention it but not walk through it, because the listener's attention budget is precious. The intuition is this. If a task naturally decomposes into smaller subtasks — like, the problem really does have a structure where one piece can be solved somewhat independently of another — then a modular system that has separate components for those pieces will learn it more efficiently than a monolithic system that has to represent everything at once. The savings grow as you demand higher and higher accuracy. The authors prove this under three technical assumptions, and they're honest that those assumptions aren't tested against the actual experiments — meaning the theorem says modular can be more efficient when the task is modular, but it doesn't prove that the tasks where NeuroMAS wins are actually the modular ones. So treat the theory as a sanity check on the idea, not as an explanation of the experimental results.
22:28Finn: Which is the right way to read most theory in this corner of the field, honestly.
22:33Cassidy: Probably. So let me try to land where this leaves us.
22:36Finn: Sure. What does this paper mean if you zoom out?
22:39Cassidy: I think it does three things. First, it offers a new axis of scaling that's accessible to people who can't afford to make bigger base models. A six-hundred-million-parameter model is something you can run on a single GPU. If you can get frontier-adjacent performance on a task by being smart about how three copies of that model talk to each other, that's a real lever for academic labs and small organizations, not just the big AI companies. The whole field has been organized around "scale the base model," and that's been pulling more and more of the work behind closed doors. Organizational scaling is something everyone can do.
23:16Finn: With the very large caveat that we don't yet know whether organizational gains hold when the base model is already capable. That's the open question this paper does not answer.
23:27Cassidy: Right. Second thing — it's an intellectual reframe. The whole "agent" discourse is currently soaked in personas. Your coding agent. Your research agent. Your customer service agent. The paper is saying: that's the wrong abstraction. Agents should be positions in a learnable graph, not job titles in a workflow. Specialization should emerge from training, not from system prompts. Whether or not the field actually moves in this direction, this is a position worth taking seriously.
23:58Finn: And the third thing?
23:59Cassidy: The third thing is a concrete warning. Don't naively scale multi-agent systems. The from-scratch numbers in the progressive growth experiment are a quiet alarm bell. If your instinct is "I'll just spin up twenty agents in a clever topology and train them end-to-end," you should know that the literature now has at least one paper showing that pattern can make things actively worse. Multi-agent scaling needs curriculum. It needs to be grown, not summoned.
24:28Finn: That's the bit I'll carry. The path dependence. The idea that you can't get to a high-performing seven-node system by initializing a seven-node system — you have to walk there from a three-node system that already works. That's a generalizable principle, and I don't know what other systems it applies to, but it applies somewhere beyond this paper.
24:50Cassidy: Worth keeping an eye on. And worth saying that the whole research program — treating multi-agent LLM systems as trainable neural architectures rather than hand-designed workflows — is at the very beginning. This paper is a proof of concept on small models and modest benchmarks. The interesting follow-on work would be: does this hold at scale, does progressive growth generalize as a principle, and can you discover the right topology automatically the way neural architecture search did for vision networks. None of those have been done. They're the obvious next moves.
25:26Finn: We'll see who runs them. The paper's linked in the show notes along with a few related reads if you want to follow the thread. And if you want the full transcript with definitions baked in, plus the cross-links to the other episodes that touch this stuff, that's all on paperdive.ai.
25:44Cassidy: Thanks for listening to AI Papers: A Deep Dive.