Multi-Agent Systems: Coordination, Collapse, and Collective Intelligence
Coordination is the failure surface
The dominant failure mode in multi-agent LLM systems isn't bad reasoning by any agent; it's coordination bugs no human spots on a casual read. Wiring LLM protocol design into TLA+ model checking catches deadlocks before deployment, and verified protocols absorb roughly half the damage of swapping in a cheaper model — verification as an operational buffer, not correctness theater E034. The most pointed recent result indicts the default architecture: routing findings through a central manager both serializes parallel work and corrupts it — a manager paraphrased a correct answer into vagueness — and sharing through a boss makes attempts so correlated that Pass@4 gets worse than not sharing at all. A verified shared whiteboard with citation-anchored notes beats the boss by ten points at half the cost E130. Naive parallelism fails for the same correlation reason: 64 voting agents barely beat one because they sample the same mistakes, whereas a shared evidence graph makes parallel scaling keep paying E051. Document review across partitioned workers shows the structural ceiling — when no agent reads the whole document, cross-section defect detection collapses 74-100% regardless of model capability E087.
Organization is a scaling axis
Held-constant comparisons make the case that structure itself scales. A fixed trainable-parameter budget split across three communicating agents nearly doubles accuracy versus one agent with the whole budget — with the caveat that identical architectures succeed or fail depending on whether they were grown progressively or trained from scratch E060. Freezing the agents and training only a small communication hub lifts per-agent accuracy from 36% to 58% on hard search E083, and even the handoff timing matters: streaming a reasoning chain so downstream agents anchor on the clean head before the rotting tail arrives beats whole-chain transfer E116. The exotic end is market mechanisms: deliberately hobbled agents with virtual money, auctions, and backward payments self-organize into teams that beat an unrestricted soloist, with the price mechanism solving credit assignment for free E107. Recursive delegation pushes the axis into the weights themselves — training a model to hire copies of itself produces phase transitions on hard tasks and lets a 30B model match frontier systems on inputs six times its context window E028.
Diversity collapses unless engineered in
The deflationary result hanging over autonomous multi-agent research pipelines: put three LLMs in open-ended conversation and semantic content barely moves over a thousand rounds — roughly three times more anchored than human threads — and temperature, personas, model mixing, and even diversity-trained RL all fail to break the pattern, with induction-head copying identified as the mechanism E073. The constructive counterpoint is that divergence can be engineered at the organizational layer: a lab-shaped team with shared logs, peer critique, dead-end registries, and replication gates found seven genuine improvements to a training pipeline where a matched-budget lone agent found zero E095, and an open ecosystem of anonymous agents sharing results and failed attempts on a public forum relay-raced a 40-year-old kissing-number record past AlphaEvolve E129. The pattern: collective intelligence comes from infrastructure that preserves and circulates disagreement, not from conversation itself.
Episodes anchoring this topic
- When Splitting One Model Across Three Agents Doubles Its Accuracy
The controlled parameter-budget experiment establishing organization itself as a scaling axis.
- Why AI Agents Coordinate Better Through a Shared Board Than a Boss
The indictment of manager-centric architectures and the verified shared-context alternative that beats them at half the cost.
- When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving
The robust negative result that LLM populations don't generate semantic diversity, across twelve intervention families.
- Seven Wins to Zero: How Organizing AI Agents Like a Lab Changes the Search
The seven-versus-zero result showing lab-style coordination protocols, not smarter agents, drive long-horizon discovery.
- Training the Translator: How a Small Communication Model Lets Agent Teams Outperform Themselves
Showed the communication layer is the right thing to train, with frozen agents and an RL-trained hub.
- How a Market of Crippled AI Agents Outscored One Unrestricted Model
Demonstrated market mechanisms solving credit assignment and workflow design without any designed orchestration.