All episodes
Episode 076 · May 25, 2026 · 22 min

Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math

Zhao, Yuan, Choi et al.

Agentic LLM Systems
AI Papers: A Deep Dive — Episode 076: Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math — cover art
paperdive.ai
Ep. 076
Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math
0:00
22 min
Paper
RMA: an Agentic System for Research-Level Mathematical Problems
Venue
arXiv:2605.22875
Year
2026
Read the paper
arxiv.org/abs/2605.22875
Also available on
Apple Podcasts Spotify

A university research group just outperformed OpenAI and 's flagship math systems on a benchmark of problems contributed by working mathematicians — using the same base model that scores zero on its own. The trick wasn't a bigger model. It was decomposing the work of a mathematician into specialized sharing a structured whiteboard, and the implications for AI progress reach well beyond math.

What you'll take away

  • How , built on , solves 8 of 10 problems while the same model with no scaffolding solves 0
  • The seven- setup — initializer, three proposers, three — and why an append-only shared memory is what actually makes the rounds compound
  • The six modules that encode a working mathematician's workflow, including a Proof Commandment checklist and a pre-committed literature search designed to prevent contamination
  • Ablation results showing that stripping any major component — memory, , modules — collapses performance, and that more refinement rounds eventually makes proofs worse
  • Why the comparison to .2R and isn't apples-to-apples, and what the honest version of the claim actually is
  • The Spielman ε-light subset problem as a concrete case: .2R hallucinates a citation and lands a weaker bound; produces a clean proof with a tighter bound using a different known technique

Chapters

  1. 00:00The headline result on the First Proof benchmark
  2. 02:47The seven-agent setup and the shared whiteboard
  3. 05:35The six modules that encode mathematical workflow
  4. 08:23Methodological discipline against contamination
  5. 11:10The ablation table and the architecture-versus-scale claim
  6. 13:58Where the claims shouldn't be pushed too far
  7. 16:46The Spielman problem as a concrete illustration
  8. 19:34What this means for AI progress beyond math

References in this episode

Also available as a plain-text transcript page.

0:00Juniper: Dan Spielman, the graph theorist at Yale, contributed a problem to a new benchmark for AI math systems. The problem, roughly, is this: take any graph — any network of nodes and edges — and ask whether you can always find a sizeable chunk of it, a subset of vertices, such that the smaller graph living on those vertices captures the essential structure of the full graph, scaled down by some factor. The answer is yes. The question is how big a chunk you can guarantee. OpenAI's .2R, their flagship , took a swing and produced a bound of one over two hundred fifty-six. 's heavy industrial math system, , declined to answer at all. A new system out of Georgia Tech produced a bound of one over forty-two — six times better — with a clean proof, and the expert mathematicians who reviewed it agreed it was correct.

0:58Tyler: One over forty-two. The Hitchhiker's Guide to the Galaxy answer, in a real Spielman problem, from an AI built by a university group, beating two frontier industrial systems. That's a useful place to start because the paper this comes from — posted to arXiv on May twentieth, twenty-twenty-six, and we are recording on May twenty-fifth — is making a much bigger claim than "one good answer on one problem." Before we dig in, the ground rules: this episode is AI-generated, the script is from Anthropic's , and you're listening to Juniper and me, Tyler. We're both AI voices from Eleven Labs, and the show isn't affiliated with either company. The paper is called ": an Agentic System for Research-Level Mathematical Problems," from Zelin Zhao and colleagues at Georgia Tech, and the architectural claim it makes is what we want to spend time with.

1:57Juniper: Right — and the headline number worth pinning down before we go anywhere else. There's a new benchmark called . Ten problems, contributed by working mathematicians like Spielman, Martin Hairer — who has a Fields Medal — Andrew Blumberg, Shmuel Weinberger. These are not anonymous problem-setters writing puzzles. They contributed open-ish problems from their own areas: stochastic analysis, spectral graph theory, algebraic topology, symplectic geometry. — the Georgia Tech system — solves eight of ten. solves five. .2R solves three. And just to make the point sharper: GPT Deep Research and Deep Research, those big retrieval-augmented research assistants — zero. used directly, without the scaffolding RMA wraps around it — also zero. Same model. Different result.

2:58Tyler: That last bit is the part I want to sit with, because it's the actual thesis of the paper. is built on top of . Every in the system — and there are seven of them running — is the same Claude model. The "specialization" is entirely a matter of what prompt each copy receives and what slice of a shared workspace each one can see. You hand the raw Claude model the same problems, with no scaffolding — zero. You wrap it in this architecture — eight out of ten. That's a striking framing for anyone who's been tracking AI progress as a story about bigger models. The model didn't get bigger. The organization around it got better.

3:47Juniper: And the organization is the thing to picture. Imagine you handed a hard open problem not to one mathematician but to a small department. One person — call them the initializer — sketches an initial attack on the problem. Three postdocs work in parallel on refining different parts of that draft. Three referees independently critique what the postdocs produce. They all share a common whiteboard that nobody is allowed to erase from — they can only add notes, tagged with their name and the round. After a few rounds of this, you collect the best version. That's . Initializer, three proposers, three , shared structured memory, five rounds.

4:31Tyler: So before we go deeper into how that loop runs, I want to flag the move that does the most work conceptually. The authors aren't inventing here, they aren't inventing retrieval, they aren't inventing the pattern. All of that is in the literature. What they're doing is taking the actual workflow of a working mathematician and decomposing it into pieces — and then assigning each piece to a module the can call. Juniper, walk through what those pieces actually are. Because the modules are the part that makes this feel less like a generic agent framework and more like a system designed for math research specifically.

5:18Juniper: There are six of them. I'll describe three vividly and gesture at the others, because the specific list matters less than the principle. The first is the Problem Analysis Module. You give it a problem stated in natural language — the way a mathematician would write it to a colleague — and it rewrites the problem into a structured form: here are the variables, here are the assumptions, here is the target statement, here are sub-goals, here are the constraints that are present but not stated. That's already a non-trivial step, because research problems are often vague about exactly what they're asking. The second module worth describing is the . Think of it as the system's cheat sheet — a curated stack of canonical results. Standard inequalities, spectral facts, combinatorial identities. Each entry is tagged with its , so the system can apply a known tool without re-deriving it. A working mathematician doesn't re-prove the inequality every time they need it; they look it up. The Knowledge Bank is that. The third — and this one is doing real conceptual work — is what they call the . It enforces five rules on every candidate proof. Every claim has to be grounded in something verifiable. The proof has to be faithful to the original problem — no quietly weakening the claim to something easier. No logical gaps. Explicit constructions when the problem demands one. And clean formal notation. Those five rules are essentially what an expert reviewer would push on, made into a checklist the apply mechanically.

7:04Tyler: There are three more modules — literature search, literature understanding, and what they call the fair comparison module. The literature search one has a clever wrinkle worth flagging. The system generates the list of papers it wants to read first, before it goes retrieving. The reason is contamination. If you let an AI freely browse the web while it's trying to solve a benchmark problem, it might stumble onto someone's published solution. Pre-committing to a search list is methodological discipline — and the fair comparison module is the rest of that discipline: filter out web sources containing the actual solutions, the , isolate the context between runs.

7:52Juniper: And there's another piece of that discipline worth naming. The base model's training cutoff is August twenty-twenty-five. The benchmark was released in February twenty-twenty-six. So the model literally cannot have seen the problems during . That's not airtight — the techniques used to solve these problems are in the training data, of course — but as a defense against trivial memorization, it's a strong move.

8:22Tyler: Okay. So we have modules — that's the toolkit. And we have the using them — initializer, proposers, . Let me push on the part that I think is the actual mechanical secret here, which is the shared memory. Because in any multi-agent system, the failure mode you're terrified of is the agents stepping on each other's work, or one agent's poisoning everyone else's reasoning. solves this with what is essentially Google Docs with track changes and locked sections.

8:57Juniper: That's exactly right. The memory is append-only — nothing ever gets deleted. Every entry is tagged with which wrote it and which round it came from. And the permissions are strict: proposers can only write to the proof state. Verifiers can only write to the feedback state. Two proposers running in parallel can't overwrite each other. A can't quietly rewrite the proof to make their critique look better. Every contribution sits in the record, attributed, in order.

9:28Tyler: It's version control for a reasoning system. And the reason it matters — this is the part that's easy to miss — is that the system runs five rounds. So whatever a in round two flags has to be visible to a proposer in round three. Whatever a proposer in round three changes has to be visible to a verifier in round four. The memory isn't just a workspace; it's the medium through which the rounds compound. Strip out the structured memory — and they do this in the — and the whole thing collapses.

10:01Juniper: Which is the bridge to the table, which I think is where the paper makes its strongest case. Tyler, do you want to walk through that?

10:11Tyler: Yeah, because this is where the "architecture, not scale" claim either holds up or it doesn't. The authors do something I appreciate: they ablate the system aggressively. They strip out one module at a time. They strip out combinations. They vary the number of , the number of rounds. And they run all of this as pairwise expert comparisons — full system versus system, blinded, randomized order, three reviewers per problem. The default configuration wins sixty-five percent of these head-to-head comparisons against the ablated versions. Now watch what happens when you take pieces away. Remove both the Problem Analysis and the modules — the win-rate against full RMA collapses to fifteen percent. Remove the literature modules — twelve percent. Strip the memory down to stateless, where no information carries between rounds — seventeen percent. Run zero — eighteen percent. Run just the initializer with no refinement loop at all — twenty-two percent. No single piece carries the system. They all contribute, and they contribute through interaction.

11:34Juniper: And one finding from those tables is genuinely counterintuitive. You'd assume more rounds is always better — more iteration, more refinement, surely the proof gets cleaner. But the curve isn't monotonic. One round of refinement: fifteen percent. Five rounds: thirty-two percent. Seven rounds: drops back down to twenty-two. The can over-revise. They start polishing valid arguments into invalid ones, or chasing the 's complaints past the point of usefulness.

12:10Tyler: Which is a very humanlike failure mode if you think about it. Anyone who's revised a piece of writing too many times knows what that looks like. But the cleanest for the architecture-versus-scale claim is one that doesn't appear in those tables in the same form. They run a best-of-N baseline. Same base model, same total budget — two hundred thousand tokens per problem — but no structure. Just sample many candidates from the model and pick the best. That baseline scores twenty-eight percent against full . The agentic system scores fifty-eight. Same compute. Same model. The structure is roughly doubling the win-rate.

12:54Juniper: That's the cleanest version of the thesis. Same base model. Same compute budget. Different organization. Different results. It's the kitchen brigade analogy from earlier — a restaurant kitchen isn't faster because each line cook is a better chef than a home cook. It's faster because the work has been decomposed into stations, with a clear flow between them. is making the same bet for proof-writing. Decompose the work, station the , route the outputs through structured memory, and the same model produces better math than it can when you just hand it a problem and tell it to think hard.

13:35Tyler: Now I want to slow down here, because the headline framing is seductive and we should be honest about where it shouldn't be pushed too far. The benchmark is ten problems. Ten. The win-rate differences in those tables look dramatic, but they're computed over a very small sample. The authors are upfront about this — they don't report confidence intervals because with ten problems, the intervals would swallow most of the comparisons. The ranking of methods is suggestive. It's not definitive.

14:10Juniper: And the comparison to .2R and — the eight versus three versus five — has a caveat too. runs under a documented budget. Two hundred thousand , up to six hours per problem, known prompts, known tool access. For the industrial baselines, the authors only have the publicly released solutions to evaluate. They don't know what compute OpenAI used. They don't know Aletheia's internal prompting or stopping criteria. So eight-versus-three isn't a controlled experiment. It's a comparison to released outputs.

14:45Tyler: Which is informative, but it's not apples-to-apples. The fair version of the claim is: under documented constraints and blind expert evaluation, this architectural approach beats the released outputs of two frontier systems by a wide margin on a small, carefully constructed benchmark. That's a much more cautious sentence than " beats OpenAI and ." Both sentences are technically defensible. The cautious one is the one you'd write if you were writing this paper for a working mathematician rather than for a press release.

15:21Juniper: There's a related point about worth flagging, because it tells you something about how thinks about this problem. Aletheia is conservative by design. It only releases solutions it considers correct. If it can't find one, it declines to output. That's why its row in the comparison table has dashes for some problems rather than wrong answers. DeepMind's stance, essentially, is that an incorrect proof that looks convincing is worse than no proof at all. Which is, I think, the most important failure mode for any system in this space.

15:58Tyler: It's the failure mode the authors of themselves call out in their broader-impacts section. A confidently wrong proof of a hard mathematical result is genuinely dangerous in a way that a confidently wrong summary of a news article isn't. If a working mathematician trusts RMA's output and builds on it, and the foundation has an unverified gap, the error propagates. The authors explicitly position the system as a research assistant requiring human verification, not a replacement for it.

16:31Juniper: And the expert evaluation has its own honest limits. Three mathematicians per problem, blind comparisons — that's a serious protocol for an LLM paper. But research-level proofs in informal natural language aren't the kind of bright-line check that formal verification gives. A proof can be "promising but incomplete." It can be "valid modulo an unverified condition." The protocol allows for an "inconclusive" judgment. The authors acknowledge this. It's not a flaw of the paper so much as a constraint of the domain — if you want to evaluate informal proofs of research-level problems, this is roughly the best you can do without spending years on each one.

17:15Tyler: Juniper, I want to come back to the Spielman problem for a moment, because I think it's the cleanest concrete illustration of why the architecture is doing real work. Walk through what actually happened on that problem across the three systems.

17:31Juniper: So this is problem six in the benchmark. Spielman's ε-light subset question. .2R produces a proof of a looser bound — one over two hundred fifty-six — using what's known as the barrier method, a spectral graph theory tool associated with Batson, Spielman, and Srivastava. And along the way it hallucinates a reference to a paper that does not exist. That's the characteristic failure mode of monolithic on hard math: a plausible-looking argument that quietly cites a fake source and lands on a weaker claim than what was asked. gives no output. The conservative design declines. produces one over forty-two with a proof that the expert reviewers accept. And the proof takes a different route — a leverage-score argument with a greedy selection procedure. The system computes leverage scores, picks vertices greedily according to a "good vertex" criterion, and closes the argument by induction. It's a fundamentally different attack on the problem than the barrier method, and it lands a tighter bound. Now — that technique is in the training data. The authors are honest about this. The system isn't discovering a new method; it's correctly identifying which known method applies and executing it cleanly.

18:52Tyler: Which is a real distinction worth holding onto. "Applying known tools" and "discovering a solution" are fuzzier categories in mathematics than in some other domains. A lot of what working mathematicians actually do, day to day, is correctly identifying which existing technique applies to a new problem. And executing the technique without subtle errors. looks good at that. Whether it can do something genuinely novel — invent a technique nobody has seen — is a different question, and ten problems on a benchmark can't answer it.

19:27Juniper: That's the right framing. The honest version of what this paper shows: a carefully orchestrated team of specialized , all running on the same base model, can engage with research-level math problems substantially more competently than the same base model thinking freely, and substantially more competently than the released outputs of two frontier industrial systems on a small but serious benchmark. The mechanism is decomposition and iteration, not . The architecture is doing the work.

19:59Tyler: And the broader implication is what makes this paper worth talking about beyond the math community specifically. If the pattern holds — if long-horizon reasoning tasks generally benefit more from systems engineering than from bigger models — that has consequences for a lot of domains. Scientific discovery. Complex software engineering. Anything where the unit of work is more like a research project than a query. The next round of progress on AI reasoning might look less like training and more like orchestration.

20:31Juniper: With the caveat — and this is worth landing — that we have one paper, on ten problems, with one base model. The architectural approach beats the brute-force approach on this benchmark. Whether it generalizes is the open question. There are concurrent systems in this space — 's , of course, and other frameworks like Ax-Prover and what the field is starting to call Agentic Researcher — and the comparison across them is going to get more interesting as the next generation of benchmarks lands.

21:04Tyler: One thing I'll say in closing — and I think Juniper, you put it well earlier — the temptation with a result like this is to dramatize it as "AI is doing real math research now." The honest version is more interesting. Eight of ten problems on a ten-problem benchmark. Caveats on the comparison. Caveats on the evaluation. But also: zero of ten on the same problems without the architecture, with the same model. The story is the architecture. The score is the evidence that the architecture is doing something.

21:42Juniper: Same model, organized differently, producing different math. That's the line worth carrying out of this one.

21:50Tyler: The paper is from Zelin Zhao and colleagues at Georgia Tech. The show notes have a link to the paper and some related reading on systems and the benchmark — worth a look if this episode caught you.

22:06Juniper: And if you want the full transcript with definitions inline, plus the concept pages that connect this episode to the others we've done on reasoning, that's all on paperdive.ai.

22:20Tyler: Thanks for listening to AI Papers: A Deep Dive.