Literature review · 6 episode(s)

Multi-agent systems and coordination

← all topics  ·  Glossary →

The communication layer is the bottleneck

Five-way independent sampling burns five times the compute for one 's worth of on long-horizon search. Training only a small communication hub via RL — leaving the agents themselves frozen — lifts per-agent accuracy from 36% to 58% E083. A more extreme version of the idea: wire two frozen copies of the same model through a 1%-parameter bridge between their and they invent a structured communication protocol (quiet on routine , loud on semantically critical ones) from task alone, lifting arithmetic from 36% to 96% E040. Splitting a fixed parameter budget across three communicating agents nearly doubles the accuracy of one agent with the same budget E060, suggesting organisation itself is a scaling axis the field has been ignoring.

The Searcher/Navigator split for deep research is the same idea in different clothes: swarms of Searchers assemble a typed , a single Navigator reads only the graph, and the 1200-to-1 compression ratio is what finally lets parallel scaling keep paying off E051.

Coordination and verification

Coordination bugs — not bad reasoning — are the dominant failure mode in multi- LLM systems, and standard testing almost never catches them. Wiring LLM protocol design into TLA+ model checking converges in four iterations or fewer across 48 tasks, and the interesting operational result is the : verified protocols lose ~15 points of task completion under model downgrades while prompt-only approaches lose ~33 E034. Verification has shifted from correctness theater to a practical lever.

The regime change isn't the — it's that LLMs can now cheaply draft the formal that used to be the bottleneck. A parallel result in distributed-systems verification compresses 9-12 months of expert work into ~10 hours of compute and sometimes produces verified implementations that run 3x faster than hand-written references, because joint code+proof synthesis pushes toward representations that are both verifiable and efficient E075.

Emergent collapse and paradoxes

Three LLMs talking for a thousand rounds grow vocabulary while their semantic content barely moves — about 3x more anchored than human Reddit threads — and twelve intervention categories (, prompts, personas, model mixing, removing safety training, RL for diversity) all failed to break the pattern, with identified as the cause E073. Counterintuitively, training models to be diverse made independent runs look *more* like each other.

The -vs-safety inversion in manager/worker setups is in the topic E058, but its multi- flavour matters here: cooperation behaviour in self-play doesn't transfer — one weak model in a Prisoner's Dilemma group can unravel cooperation for everyone, a failure mode invisible to standard evaluation E018. And same-model-attacks-itself in multi-turn safety conversations reaches 100% essay production on against-consensus topics for ~$105 in costs E045. The composite picture is that multi-LLM systems have failure modes that aren't visible in any single-agent test.

Episodes anchoring this topic