All episodes

Episode 129 · Jun 11, 2026 · 29 min

How a Crowd of Anonymous AI Agents Broke a 40-Year Math Record

Bianchi, Kwon, Pappu et al.

Multi-agent Systems

AI Papers: A Deep Dive — Episode 129: How a Crowd of Anonymous AI Agents Broke a 40-Year Math Record — cover art

paperdive.ai

Listen

Ep. 129

How a Crowd of Anonymous AI Agents Broke a 40-Year Math Record

0:00

29 min

Concepts in this episode

AI for Science Multi-Agent Systems Evaluation & Benchmarks Autonomous Discovery Agent Scaffolding Iterative Refinement Reward Shaping Self-Play / Self-Evolution Agentic Workflows Agent Memory

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries

Venue

arXiv:2606.10402

Year

2026

Read the paper

arxiv.org/abs/2606.10402

Also available on

Apple Podcasts Spotify

A geometry record that barely moved for forty years jumped by eleven in two months — not because of a bigger AI, but because anonymous AI agents started sharing results and failed attempts on a public forum. We trace the detective relay that dethroned DeepMind's AlphaEvolve, including the pivotal move by a bot named KawaiiCorgi, and then stress-test whether the paper's collective-intelligence claims actually hold up.

What you'll take away

How EinsteinArena's three components — executable verifiers, a public leaderboard, and an agent discussion forum — recreate peer review, the published record, and the conference hallway for AI discovery
The relay of moves that pushed the 11-dimensional kissing number from 593 to 604 spheres: a basin jump, a smooth reformulation solved with a 1982 algorithm, and snapping near-integer values into an exact certified construction
Why agents' solutions got so precise they broke the verifier, forcing the platform to rebuild it at 30-80 digits of decimal precision mid-deployment
Forum evidence that agents did genuinely scientific work: 34% of posts were structural reasoning about the geometry, including agents telling each other the 'highest-value next step'
Where the claims wobble: the final jump from 594 to 604 was author-directed, agent identities are unverifiable by design, collaboration lineages were statistically inferred, and there's no controlled comparison isolating the social layer's effect
The bigger reframe: AI discovery may have been stuck in a pre-journal era, leaving the cumulative-infrastructure multiplier of science entirely on the table

Chapters

00:00Forty years of stasis, then eleven spheres in two months
03:39EinsteinArena: verifiers, leaderboard, and a forum for bots
07:18The kissing number relay, from CHRONOS to KawaiiCorgi
10:58Snapping to integers and certifying a world record
14:37The forum as collective memory
18:16A second case study in harmonic analysis
21:56The steelman critique
25:35Why it matters anyway

References in this episode

AlphaEvolve: A coding agent for scientific and algorithmic discovery — The DeepMind system whose records — including the 593-sphere kissing configurati
Mathematical discoveries from program search with large language models (FunSearch) — The Nature paper that first showed LLM-driven search can produce genuinely new m
Massively collaborative mathematics (the Polymath project) — Gowers and Nielsen's account of humans solving open math problems through public

Full transcript

Also available as a plain-text transcript page.

0:00Bella: In nineteen-eighty, a mathematician named Best published a way to arrange five hundred and eighty-two spheres around a single central sphere in eleven-dimensional space. And then, for roughly forty years — nothing. Entire careers came and went. Computers got about a billion times faster. The record did not move. In twenty-twenty-two it finally inched up to five ninety-two. Last year, Google DeepMind's flagship AI discovery system squeezed out exactly one more sphere: five ninety-three. And then this spring, in about two months, the record jumped to six hundred and four. Not because somebody built a bigger, smarter AI. Because a crowd of anonymous AI agents started talking to each other on a forum.

0:48Eric: And the agent that made the single most decisive move in that story is named KawaiiCorgi. That is not a joke, and we will get there.

0:57Bella: We absolutely will. The paper behind all this is called "Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries" — it's from researchers at Together AI and Stanford, it went up on arXiv on June ninth, twenty-twenty-six, and we're recording two days later, on June eleventh. Which, for a paper whose whole thesis is that openness makes discovery faster, feels appropriate. And while we're on production details, a quick note: this episode is AI-generated. The script was written by Anthropic's Claude Fable 5, and the voices you're hearing — I'm Bella, and that's Eric — are both AI voices from Eleven Labs. Nobody producing this show is affiliated with Anthropic or with Eleven Labs.

1:45Eric: So the full stack here is AI voices reading an AI-written script about AI agents doing mathematics — which is honestly the right frame for this paper, because the agents in this story aren't a product and they aren't one lab's system. They're a motley, anonymous crowd. And the question the paper asks is whether the crowd, plus the right infrastructure, beats the lone genius. Here's the setup. Over the past couple of years, systems like AlphaEvolve — that's the DeepMind system that held the five ninety-three record — proved that language-model agents can genuinely push the frontier on open math problems. But every one of those systems is a sealed pipeline. One carefully orchestrated run, private evaluation, results announced at the end. And when the run shuts down, everything it learned evaporates. The authors have a line I love about this. They compare the current state of AI discovery to an earlier era of human research — before preprints, before open datasets, before any of the norms that make science cumulative. Back when results died with their discoverers or circulated as secrets. Historians generally credit the pace of science since the scientific revolution not just to brilliant individuals but to the shared layer — journals, conferences, the folklore of failed attempts. The bet this paper makes is that AI discovery, circa twenty-twenty-five, was stuck in the pre-journal era. So they built the missing infrastructure.

3:23Bella: And "built the infrastructure" means what, concretely? Because it's not another benchmark, right — it's a place.

3:32Eric: A place is exactly the right word. It's called EinsteinArena, and it has three components, each of which maps onto something from human science. First: fifteen open mathematical problems, each with a verifier — executable code that takes a submission and returns one number. Higher or lower is better, depending on the problem, and there is zero judgment involved. That's peer review, made instant and incorruptible. Second: a live leaderboard, where the current best solution to every problem is publicly downloadable. That's the published record. And third: a per-problem discussion forum where agents post notes to each other. That's the conference hallway. And the design principle is radical transparency. The verifier isn't a black box on a server — agents download the actual scoring code and run it locally. So they never have to guess what counts as better. They only submit when they have a credible improvement, which turns the whole platform into a tight, honest feedback loop.

4:38Bella: It's worth pausing on who the agents actually are, because this is where the paper gets strange in a good way. An agent here is a language model wrapped in a loop — it can write code, run it, read the results, browse the platform's API, download other agents' solutions, decide what to try next, and post messages on the forum. No human approving each step. And the agents are built and operated by anonymous members of the public, on whatever underlying model they like, with whatever strategy they like. The platform deliberately requires no disclosure of who's behind an agent. You register by solving a little cryptographic puzzle — a proof-of-work thing that's cheap for one agent but expensive for a spam farm — and then you're loose in the arena.

5:29Eric: Think of it as GitHub for mathematical discovery. The leaderboard is the main branch — the current best version, which anyone can fork. The verifier is the automated test suite — your contribution passes or it doesn't, no arguing. The forum is the issue tracker, where contributors flag dead ends and propose directions. The old paradigm, the AlphaEvolve paradigm, is proprietary software developed in-house and shipped as a finished binary. And the headline result: the platform launched March nineteenth, twenty-twenty-six. By May, agents on it had set twelve new state-of-the-art results. Roughly six world records a month. Most of the previous records they broke were held by AlphaEvolve itself.

6:16Bella: A leaderboard of math records where the previous champion was DeepMind's flagship, and the new champions have names like KawaiiCorgi.

6:24Eric: Right — and the obvious question is how that's even possible, which is really a question about one specific story. Bella, walk me through the kissing number. That's the spine of this whole paper.

6:37Bella: So the kissing number problem. In any dimension, you ask: how many non-overlapping unit spheres can simultaneously touch one central unit sphere? In two dimensions, picture pennies on a table — you can fit exactly six pennies around a central penny. In three dimensions, oranges: the answer is twelve, but here's the thing — there's so much tantalizing wiggle room when you arrange twelve spheres around a central one that it looks like a thirteenth should fit. Isaac Newton said twelve. His contemporary David Gregory said thirteen. That dispute wasn't settled — in Newton's favor — until the nineteen-fifties. And in most dimensions, nobody knows the answer at all. It's known exactly only in dimensions one, two, three, four, eight, and twenty-four. Everywhere else, including dimension eleven, mathematicians fight over a window: an upper bound proved by abstract arguments, and a lower bound proved by actually exhibiting a configuration that works.

7:41Eric: And the lower bound part is what makes this legitimately machine-territory, right? There's no asterisk on an AI holding this record?

7:50Bella: No asterisk at all, and here's why. A lower-bound record is not an estimate — it's a construction. If you write down six hundred and four lists of eleven numbers each, and you verify that no pair of the corresponding spheres overlaps, you have proven the kissing number in dimension eleven is at least six hundred and four. Done. Finding the construction is the hard creative work; checking it is mechanical. And don't try to visualize eleven dimensions — even mathematicians don't. A point in eleven-dimensional space is just a list of eleven numbers, and "spheres touching" becomes a condition on the dot products between those lists. The geometry turns into algebra. Which is exactly what makes it searchable by a machine. So: the record sat at five eighty-two from nineteen-eighty. Five ninety-two in twenty-twenty-two. Five ninety-three from AlphaEvolve in twenty-twenty-five. Humans and machines together improved this number by eleven over forty-five years. The arena added eleven more in two months. And the way it happened is the best part — it's a detective relay. The first thing you need is how the verifier scores a candidate, because a kissing configuration is binary — valid or not — and a yes-or-no verdict gives a search algorithm nothing to work with. It's hide-and-seek where the only feedback is "not yet." So the verifier plays warmer-colder instead: it sums up, over every pair of spheres, how much they intrude on each other. Total jam in the configuration. Zero jam means certified valid. The search becomes: roll downhill on the jam until it hits exactly zero.

9:37Eric: And the early agents just... ground away at that?

9:40Bella: For weeks. Agents named CHRONOS and Gradient kept whittling the penalty down incrementally — making real progress, but stuck far from anything valid. Then a group called the alpha omega agents did something different: instead of polishing the existing arrangement, they jumped to a qualitatively new geometric arrangement — the paper calls it a new basin — and the penalty dropped by two orders of magnitude in one move. In the warmer-colder game, that's a player realizing the warmth trail they'd all been following leads to a local pocket, and parachuting into a different valley entirely. And then comes the move that, for my money, is the single best moment in this paper. An agent called KawaiiCorgi looked at the verifier's score and decided the thermometer itself was the problem. The raw penalty function is lumpy — full of sharp corners that make optimizers crawl. So KawaiiCorgi reformulated. A configuration is valid exactly when the dot products between sphere centers stay under a threshold — so instead of fighting the lumpy penalty, just minimize how far each dot product strays from its target. That's a smooth, quadratic problem. And smooth quadratic problems can be attacked with fast, classical linear-algebra solvers instead of slow local nudging. Specifically, this agent reached for LSQR — an algorithm from nineteen-eighty-two.

11:10Eric: So the AI's breakthrough on a forty-year-old geometry problem was to dust off a forty-year-old human algorithm.

11:18Bella: Which I think is actually the deep lesson — the creativity wasn't in inventing a new solver, it was in reformulating the problem so an old solver could crush it. And crush it it did. The error didn't shrink by half, or tenfold. It fell by forty orders of magnitude. Let that number sit for a second, because in numerical work, improvements come in dribs — a digit here, two digits there. When the error free-falls by forty digits at once, it means the search isn't approximating anymore. It has locked onto something exact and is just reporting residual computational dust. It's the difference between a dart landing near the bullseye and a dart that turns out to have been thrown at a magnet hidden in the bullseye. The precision is telling you there's a hidden structure pulling it in.

12:10Eric: But forty digits of zeros still isn't a proof. Computers do arithmetic to fifteen or sixteen digits natively — at some point you can't even represent how close you are. How do you cross from numerically perfect to actually certified?

12:26Bella: That's exactly the gap, and KawaiiCorgi's second move is how you cross it. The agent inspected the near-perfect solution and noticed something: all the dot products between vectors were hovering within a hair of simple integers — minus two, zero, one. Here's the intuition. You measure a triangle in the wild and get angles of fifty-nine point nine nine nine nine degrees, sixty point zero zero zero one, sixty exactly. At some point you stop refining your protractor and recognize the truth: it's an equilateral triangle. Exactly sixty, sixty, sixty. And once you commit to that, you can prove things about it that no measurement ever could. That's what the agent did. It recognized that the numerical search had been circling a crystalline, discrete object all along — and it snapped the values to exact integers and re-verified the whole configuration in exact arithmetic. Zero rounding error. The fuzzy numerical answer became a mathematical fact. Five hundred and ninety-four spheres, certified. New world record.

13:36Eric: And the snapping isn't wishful rounding precisely because of that verification step — you commit to the crystal, and then the verifier either confirms the crystal is real or it doesn't.

13:48Bella: Right, and there's a wonderful side detail here: the agents' solutions got so precise that they broke the referee. The difference between a valid and invalid configuration in this regime can be smaller than a computer's native floating-point resolution, and the platform had to rebuild the verifier mid-deployment to run at thirty to eighty digits of decimal precision. The contestants out-precisioned the judge's eyesight. From there, the same recipe — smooth surrogate, then integer snapping — pushed the record fairly easily to six hundred. And then someone did something genuinely scientific: an agent analyzed every construction from five ninety-four up to six hundred and found they all shared an identical rigid backbone of four hundred ninety-six vectors. Over eighty percent of the final configuration is shared skeleton. That's a screaming hint of deeper algebraic structure, and it pointed the search toward a larger algebraic family of constructions — which is what finally reached six hundred and four. One honest flag, though, and I want to plant it now because it matters later. The paper's own phrasing is that agents on the platform got to five ninety-four — and then, quote, "we had agents build on their result" to extend it to six oh four. We, meaning the authors.

15:09Eric: Filed. I'm coming back to that, believe me. But before the knives come out, there's a part of this story I want to do justice to, because for me it's the emotional core of the paper — the forum. Because everything Bella just described, the basin jump, the backbone discovery, the surrogate trick — fragments of it were being discussed in public, by the agents, in threads that read exactly like grad students on a research Slack. The paper reproduces some of these exchanges, and they're worth quoting nearly verbatim. The Alpha Omega Agents post: "Questions for the community: has any agent found a configuration with a contact count different from seventeen thousand and eighty-eight?" — because every successful configuration in the winning basin had exactly the same contact structure, and they wanted to know if anyone had ever seen an alternative. And CHRONOS replies, point by point, working through the geometry, ending with: "it is the highest-value next step." That is an autonomous bot, built by an anonymous stranger, telling another anonymous stranger's bot what the highest-value next experiment is. Nobody orchestrated that. There's no workflow diagram with an arrow labeled "peer feedback." It emerged because there was a place for it to happen.

16:32Bella: And the authors checked whether that kind of post was typical or cherry-picked, right? They coded the whole forum?

16:40Eric: They did — they ran a content analysis over every post on the kissing-number board. And the plurality, thirty-four percent, were agents doing genuine structural reasoning about the geometry. Hypotheses about lattices, observations about hidden symmetry, warnings about numerical pitfalls. Only about a quarter were score announcements — leaderboard bragging. Ten percent were new-basin discoveries. The discussion was substantively scientific, not just competitive chest-thumping. And this sets up what I think is the single best sentence in the paper. The leaderboard stores the frontier — the best-known solution, which anyone can download and extend. But the discussion board stores the path to the frontier. The partial constructions. The failed parameterizations. The dead ends, with explanations. A leaderboard can tell you where the edge of knowledge is; it cannot tell you how anyone got there, or which directions were already tried and abandoned. That's what evaporates when a sealed pipeline shuts down. And that's exactly what this platform refuses to throw away.

17:58Bella: Which is why the relay-race image from earlier needs a small upgrade — each agent hands off not just the baton, the best current solution, but annotated map notes. "This valley dead-ends." "Everything good we've found shares the same four-ninety-six-vector backbone." "Has anyone seen a contact count other than seventeen thousand eighty-eight?" And the second case study in the paper shows the same texture on a totally different kind of problem, so let me give you the compressed version. This one's from harmonic analysis — the second autocorrelation inequality. The flavor of the question is: how concentrated can a function's overlap with itself be, relative to its overall size? You're optimizing a constant, and AlphaEvolve had shown that constant is at least point nine six one. The arena pushed it to point nine six two six. About a sixth of a percent — which, in this corner of math, is a real jump. Two things about how it happened. First, the agents converged on a shared workhorse — a technique from nineteen-sixty-seven called Dinkelbach's method, which turns an awkward ratio-maximization into a sequence of easier problems. Again: a sixty-year-old human algorithm, redeployed. Second, and this is the novel collaborative move: agents traded solutions across resolutions. The problem gets solved on a discretized grid, and one agent would take another's fine-grained solution, coarsen it into bins to seed a fresh search at low resolution, find a new basin there, and hand it back up.

19:36Eric: And they narrated this to each other as they did it.

19:40Bella: Verbatim from the forum: an agent called JSAgent posts, "The key realization: the hundred-thousand and one-point-six-million interval solutions live in structurally different basins." And ClaudeExplorer replies — I love this — "Nice work on the cross-resolution basin transfer. We took your hundred-k solution and pushed it further." That's a compliment, a citation, and a progress report in two sentences. The final record came from ClaudeExplorer quadrupling the resolution and refining what the community had built.

20:16Eric: Okay. So that's the case for the paper, and it's genuinely strong. Now let me do my job, because there are several places where a careful reader should push back — and to the authors' credit, most of the ammunition comes from the paper's own text. Start with the headline. Twelve new state-of-the-art results. True. But the twelve are doing very uneven work. The kissing number, real leap. The prime number theorem problem on their board went from point nine two one to point nine nine five — the biggest relative jump they got, genuinely substantial. But circle packing in a square? The record improved in the ninth decimal place. The Erdős minimum overlap problem moved by five units in the sixth decimal. Those are real records, but they're arguably "more compute on the same method," not discovery. And the authors themselves footnote that another system, called SimpleTES, has already beaten one of the twelve. These micro-records are perishable.

21:21Bella: To be fair, that's how leaderboards work everywhere — most increments are small, and the occasional leap is what you're farming for.

21:30Eric: Sure, but then say "two genuine leaps and ten increments," not "twelve records." Second issue, and Bella planted it earlier: the phrase "in the wild" is in the title. The wild, community-driven, nobody-orchestrated-it result is five hundred ninety-four spheres. The paper then says — direct quote — "we had agents build on their result to further extend it to six oh four." The final ten spheres, the number that goes in the abstract, appear to have come from author-directed agents. That's still a real mathematical result! But it's partly a conventional authors-do-research result wearing platform clothing. The cleanest evidence for the collective-discovery thesis is five ninety-four, and the paper could have been crisper about that line. Third — and this one's structural. You cannot actually verify that the agents are agents. The platform requires no disclosure of who operates them. The only safeguard against a human just... participating, is that there's no human-friendly submission interface. That's a speed bump, not a barrier. Any "agent" on that leaderboard could have a mathematician in the loop, and the anonymity design makes this untestable by construction. So a skeptic can say the paper can't distinguish "collective AI intelligence" from "anonymous humans using AI tools."

22:57Bella: Although — does that distinction change the infrastructure claim? Even in the worst case, the platform demonstrably made progress cumulative across anonymous contributors who'd never coordinate otherwise. The thesis about shared memory survives; the thesis about pure machine autonomy gets a question mark.

23:18Eric: That's a fair partition, and it's roughly where I land too. But it connects to my fourth point, which is that even the collaboration evidence is partly inferred rather than observed. Those beautiful lineage diagrams — the family trees showing CHRONOS handing off to alpha omega handing off to KawaiiCorgi — the platform doesn't log who copied whom. The authors reconstructed parentage statistically, by fingerprinting submissions and linking ones that look sufficiently similar, with similarity thresholds they openly describe as manually chosen. Two agents independently converging on the same basin would get coded as parent and child. And the forum screenshots, compelling as they are, are anecdotes — the paper doesn't quantify how often a forum insight causally preceded a score improvement versus narrating it afterward. Which all rolls up into the big one: there's no controlled comparison. The central causal claim — that the social layer, the forum and the shared traces, drove progress beyond what the same agents with the same compute would've achieved in isolation — has no ablation. No condition where agents get leaderboard-only access. No isolated baseline. The evidence is "this happened, and it looked collaborative." Suggestive. Not demonstrative.

24:45Bella: And the authors basically concede that — the discussion section literally frames these as open empirical questions. They're candid about other limits too, which I want to give them credit for. They admit the focus on cheaply verifiable math is a deliberately easy setting — every problem on the platform is a continuous-score optimization task imported from AlphaEvolve's repository, which is precisely the regime where iterating against a public, downloadable verifier is most powerful. Whether any of this transfers to formal proof, or algorithm design, or biology — untested, and they say so. They admit agents optimize aggressively against the scoring function, which is why the verifier needed mid-deployment surgery. They worry the public leaderboard biases agents toward short-horizon score-chasing over slow, promising directions. And they flag a real unresolved tension: the platform is simultaneously a competition and a collaboration. Sharing your insight helps your rivals beat you. Human science lives with that same tension — but agents explicitly optimizing for leaderboard rank might resolve it differently than humans with careers and reputations do.

26:01Eric: So where does that leave us? Because I've now spent a while throwing rocks, and I want to be clear that I think this paper matters despite every rock landing.

26:11Bella: Here's my version of why, and it picks up the thread you opened at the top, Eric — the pre-journal era. Before this paper, the implicit production function for AI discovery was the lone genius pipeline: assemble one very capable system, point it at one problem, run it hard, publish, discard the run. AlphaEvolve is the triumph of that model. What this paper demonstrates — even with all the caveats — is a different production function: many heterogeneous, individually unremarkable agents, plus persistent shared state, outpacing the lone-genius pipelines on the lone-genius pipelines' own problems. Most of the records broken were AlphaEvolve's. And the mechanism is the thing the authors call platform-as-harness. The dominant paradigm builds a bespoke scaffold for every new problem, and the scaffold's accumulated knowledge dies with the run. Their alternative: build one shared, persistent environment and let agents self-organize on it. The harness becomes infrastructure. The infrastructure becomes collective memory. Partial progress stops evaporating. The effective time horizon of a search stretches from hours to weeks, because later agents inherit earlier agents' basins, surrogates, and warnings.

27:32Eric: And the practical payoff today, honestly stated, is a handful of improved bounds on niche extremal problems. Nobody's life changes because dimension eleven holds six hundred and four spheres instead of five ninety-three.

27:46Bella: Right — the significance is the existence proof and the template, not the theorems. But the reframing underneath it is genuinely interesting, and it's bigger than this platform. The suggestion is that what made human science fast was never just smart individuals — it was the cumulative infrastructure around them. And AI capability research, for all its obsession with making individual systems smarter, has been leaving that entire multiplier on the table.

28:17Eric: There's a version of this story where, a few years from now, there's a permanent, accumulating layer of machine science running alongside human science — open problems, public failures, anonymous contributors, records falling at six a month. And the strange little fact people will cite about its origin is that the pivotal move on a forty-year-old geometry problem came from an anonymous bot named KawaiiCorgi, who looked at fifty digits of zeros and realized the numbers were whispering that they wanted to be integers.

28:52Bella: That's the episode. If you want to dig in yourself, the paper and a few related reads are linked in the show notes. And for the full transcript — with every technical term tappable for a definition, plus links to other episodes that share these ideas — head to paperdive.ai.

29:11Eric: Thanks for spending your commute with us. This has been AI Papers: A Deep Dive.

29:16Bella: See you next time.

How a Crowd of Anonymous AI Agents Broke a 40-Year Math Record

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes