All episodes

Episode 076 · May 25, 2026 · 22 min

Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math

Zhao, Yuan, Choi et al.

Agentic LLM Systems

AI Papers: A Deep Dive — Episode 076: Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math — cover art

paperdive.ai

Listen

Ep. 076

Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math

0:00

22 min

Concepts in this episode

Agentic AI AI for Science Evaluation & Benchmarks Multi-Agent Systems Task Decomposition Iterative Refinement Agent Memory Math Reasoning Agent Benchmarks Ablation Studies Agent Scaffolding Long-Horizon Tasks Parallel Sampling

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

RMA: an Agentic System for Research-Level Mathematical Problems

Venue

arXiv:2605.22875

Year

2026

Read the paper

arxiv.org/abs/2605.22875

Also available on

Apple Podcasts Spotify

A university research group just outperformed OpenAI and DeepMind's flagship math systems on a benchmark of problems contributed by working mathematicians — using the same base model that scores zero on its own. The trick wasn't a bigger model. It was decomposing the work of a mathematician into specialized agents sharing a structured whiteboard, and the implications for AI progress reach well beyond math.

What you'll take away

How RMA, built on Claude Opus 4.6, solves 8 of 10 First Proof problems while the same model with no scaffolding solves 0
The seven-agent setup — initializer, three proposers, three verifiers — and why an append-only shared memory is what actually makes the rounds compound
The six modules that encode a working mathematician's workflow, including a Proof Commandment checklist and a pre-committed literature search designed to prevent contamination
Ablation results showing that stripping any major component — memory, verifiers, modules — collapses performance, and that more refinement rounds eventually makes proofs worse
Why the comparison to GPT-5.2R and Aletheia isn't apples-to-apples, and what the honest version of the claim actually is
The Spielman ε-light subset problem as a concrete case: GPT-5.2R hallucinates a citation and lands a weaker bound; RMA produces a clean proof with a tighter bound using a different known technique

Chapters

00:00The headline result on the First Proof benchmark
02:47The seven-agent setup and the shared whiteboard
05:35The six modules that encode mathematical workflow
08:23Methodological discipline against contamination
11:10The ablation table and the architecture-versus-scale claim
13:58Where the claims shouldn't be pushed too far
16:46The Spielman problem as a concrete illustration
19:34What this means for AI progress beyond math

References in this episode

AlphaProof and AlphaGeometry: AI achieves silver-medal standard solving International Mathematical Olympiad problems — DeepMind's prior work on AI math reasoning, useful context for how industrial sy
Twice-Ramanujan Sparsifiers (Batson, Spielman, Srivastava) — The original barrier-method paper that GPT-5.2R reached for on the Spielman benc
Self-Refine: Iterative Refinement with Self-Feedback — A foundational paper on the proposer-verifier refinement loop that RMA's multi-r
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI — The expert-contributed math benchmark in the same spirit as First Proof, useful

Full transcript

Also available as a plain-text transcript page.

0:00Juniper: Dan Spielman, the graph theorist at Yale, contributed a problem to a new benchmark for AI math systems. The problem, roughly, is this: take any graph — any network of nodes and edges — and ask whether you can always find a sizeable chunk of it, a subset of vertices, such that the smaller graph living on those vertices captures the essential structure of the full graph, scaled down by some factor. The answer is yes. The question is how big a chunk you can guarantee. OpenAI's GPT-5.2R, their flagship reasoning model, took a swing and produced a bound of one over two hundred fifty-six. DeepMind's heavy industrial math system, Aletheia, declined to answer at all. A new system out of Georgia Tech produced a bound of one over forty-two — six times better — with a clean proof, and the expert mathematicians who reviewed it agreed it was correct.

0:58Tyler: One over forty-two. The Hitchhiker's Guide to the Galaxy answer, in a real Spielman problem, from an AI built by a university group, beating two frontier industrial systems. That's a useful place to start because the paper this comes from — posted to arXiv on May twentieth, twenty-twenty-six, and we are recording on May twenty-fifth — is making a much bigger claim than "one good answer on one problem." Before we dig in, the ground rules: this episode is AI-generated, the script is from Anthropic's Claude Opus 4.7, and you're listening to Juniper and me, Tyler. We're both AI voices from Eleven Labs, and the show isn't affiliated with either company. The paper is called "RMA: an Agentic System for Research-Level Mathematical Problems," from Zelin Zhao and colleagues at Georgia Tech, and the architectural claim it makes is what we want to spend time with.

1:57Juniper: Right — and the headline number worth pinning down before we go anywhere else. There's a new benchmark called First Proof. Ten problems, contributed by working mathematicians like Spielman, Martin Hairer — who has a Fields Medal — Andrew Blumberg, Shmuel Weinberger. These are not anonymous problem-setters writing puzzles. They contributed open-ish problems from their own areas: stochastic analysis, spectral graph theory, algebraic topology, symplectic geometry. RMA — the Georgia Tech system — solves eight of ten. Aletheia solves five. GPT-5.2R solves three. And just to make the point sharper: GPT Deep Research and Gemini Deep Research, those big retrieval-augmented research assistants — zero. Claude Opus 4.6 used directly, without the agentic scaffolding RMA wraps around it — also zero. Same model. Different result.

2:58Tyler: That last bit is the part I want to sit with, because it's the actual thesis of the paper. RMA is built on top of Claude Opus 4.6. Every agent in the system — and there are seven of them running — is the same Claude model. The "specialization" is entirely a matter of what prompt each copy receives and what slice of a shared workspace each one can see. You hand the raw Claude model the same problems, with no scaffolding — zero. You wrap it in this architecture — eight out of ten. That's a striking framing for anyone who's been tracking AI progress as a story about bigger models. The model didn't get bigger. The organization around it got better.

3:47Juniper: And the organization is the thing to picture. Imagine you handed a hard open problem not to one mathematician but to a small department. One person — call them the initializer — sketches an initial attack on the problem. Three postdocs work in parallel on refining different parts of that draft. Three referees independently critique what the postdocs produce. They all share a common whiteboard that nobody is allowed to erase from — they can only add notes, tagged with their name and the round. After a few rounds of this, you collect the best version. That's RMA. Initializer, three proposers, three verifiers, shared structured memory, five rounds.

4:31Tyler: So before we go deeper into how that loop runs, I want to flag the move that does the most work conceptually. The authors aren't inventing chain-of-thought reasoning here, they aren't inventing retrieval, they aren't inventing the proposer-verifier pattern. All of that is in the literature. What they're doing is taking the actual workflow of a working mathematician and decomposing it into pieces — and then assigning each piece to a module the agents can call. Juniper, walk through what those pieces actually are. Because the modules are the part that makes this feel less like a generic agent framework and more like a system designed for math research specifically.

5:18Juniper: There are six of them. I'll describe three vividly and gesture at the others, because the specific list matters less than the principle. The first is the Problem Analysis Module. You give it a problem stated in natural language — the way a mathematician would write it to a colleague — and it rewrites the problem into a structured form: here are the variables, here are the assumptions, here is the target statement, here are sub-goals, here are the constraints that are present but not stated. That's already a non-trivial step, because research problems are often vague about exactly what they're asking. The second module worth describing is the Knowledge Bank. Think of it as the system's cheat sheet — a curated stack of canonical results. Standard inequalities, spectral facts, combinatorial identities. Each entry is tagged with its preconditions, so the system can apply a known tool without re-deriving it. A working mathematician doesn't re-prove the Cauchy-Schwarz inequality every time they need it; they look it up. The Knowledge Bank is that. The third — and this one is doing real conceptual work — is what they call the Proof Commandment Module. It enforces five rules on every candidate proof. Every claim has to be grounded in something verifiable. The proof has to be faithful to the original problem — no quietly weakening the claim to something easier. No logical gaps. Explicit constructions when the problem demands one. And clean formal notation. Those five rules are essentially what an expert reviewer would push on, made into a checklist the verifier agents apply mechanically.

7:04Tyler: There are three more modules — literature search, literature understanding, and what they call the fair comparison module. The literature search one has a clever wrinkle worth flagging. The system generates the list of papers it wants to read first, before it goes retrieving. The reason is contamination. If you let an AI freely browse the web while it's trying to solve a benchmark problem, it might stumble onto someone's published solution. Pre-committing to a search list is methodological discipline — and the fair comparison module is the rest of that discipline: filter out web sources containing the actual First Proof solutions, sandbox the tool use, isolate the context between runs.

7:52Juniper: And there's another piece of that discipline worth naming. The base Claude model's training cutoff is August twenty-twenty-five. The First Proof benchmark was released in February twenty-twenty-six. So the model literally cannot have seen the problems during pretraining. That's not airtight — the techniques used to solve these problems are in the training data, of course — but as a defense against trivial memorization, it's a strong move.

8:22Tyler: Okay. So we have modules — that's the toolkit. And we have the agents using them — initializer, proposers, verifiers. Let me push on the part that I think is the actual mechanical secret here, which is the shared memory. Because in any multi-agent system, the failure mode you're terrified of is the agents stepping on each other's work, or one agent's hallucination poisoning everyone else's reasoning. RMA solves this with what is essentially Google Docs with track changes and locked sections.

8:57Juniper: That's exactly right. The memory is append-only — nothing ever gets deleted. Every entry is tagged with which agent wrote it and which round it came from. And the permissions are strict: proposers can only write to the proof state. Verifiers can only write to the feedback state. Two proposers running in parallel can't overwrite each other. A verifier can't quietly rewrite the proof to make their critique look better. Every contribution sits in the record, attributed, in order.

9:28Tyler: It's version control for a reasoning system. And the reason it matters — this is the part that's easy to miss — is that the system runs five rounds. So whatever a verifier in round two flags has to be visible to a proposer in round three. Whatever a proposer in round three changes has to be visible to a verifier in round four. The memory isn't just a workspace; it's the medium through which the rounds compound. Strip out the structured memory — and they do this in the ablations — and the whole thing collapses.

10:01Juniper: Which is the bridge to the ablation table, which I think is where the paper makes its strongest case. Tyler, do you want to walk through that?

10:11Tyler: Yeah, because this is where the "architecture, not scale" claim either holds up or it doesn't. The authors do something I appreciate: they ablate the system aggressively. They strip out one module at a time. They strip out combinations. They vary the number of agents, the number of rounds. And they run all of this as pairwise expert comparisons — full system versus ablated system, blinded, randomized order, three reviewers per problem. The default RMA configuration wins sixty-five percent of these head-to-head comparisons against the ablated versions. Now watch what happens when you take pieces away. Remove both the Problem Analysis and the Knowledge Bank modules — the win-rate against full RMA collapses to fifteen percent. Remove the literature modules — twelve percent. Strip the memory down to stateless, where no information carries between rounds — seventeen percent. Run zero verifiers — eighteen percent. Run just the initializer with no refinement loop at all — twenty-two percent. No single piece carries the system. They all contribute, and they contribute through interaction.

11:34Juniper: And one finding from those tables is genuinely counterintuitive. You'd assume more rounds is always better — more iteration, more refinement, surely the proof gets cleaner. But the curve isn't monotonic. One round of refinement: fifteen percent. Five rounds: thirty-two percent. Seven rounds: drops back down to twenty-two. The agents can over-revise. They start polishing valid arguments into invalid ones, or chasing the verifier's complaints past the point of usefulness.

12:10Tyler: Which is a very humanlike failure mode if you think about it. Anyone who's revised a piece of writing too many times knows what that looks like. But the cleanest ablation for the architecture-versus-scale claim is one that doesn't appear in those tables in the same form. They run a best-of-N baseline. Same base model, same total token budget — two hundred thousand tokens per problem — but no agentic structure. Just sample many candidates from the model and pick the best. That baseline scores twenty-eight percent against full RMA. The agentic system scores fifty-eight. Same compute. Same model. The structure is roughly doubling the win-rate.

12:54Juniper: That's the cleanest version of the thesis. Same base model. Same compute budget. Different organization. Different results. It's the kitchen brigade analogy from earlier — a restaurant kitchen isn't faster because each line cook is a better chef than a home cook. It's faster because the work has been decomposed into stations, with a clear flow between them. RMA is making the same bet for proof-writing. Decompose the work, station the agents, route the outputs through structured memory, and the same model produces better math than it can when you just hand it a problem and tell it to think hard.

13:35Tyler: Now I want to slow down here, because the headline framing is seductive and we should be honest about where it shouldn't be pushed too far. The benchmark is ten problems. Ten. The win-rate differences in those ablation tables look dramatic, but they're computed over a very small sample. The authors are upfront about this — they don't report confidence intervals because with ten problems, the intervals would swallow most of the comparisons. The ranking of methods is suggestive. It's not definitive.

14:10Juniper: And the comparison to GPT-5.2R and Aletheia — the eight versus three versus five — has a caveat too. RMA runs under a documented budget. Two hundred thousand tokens, up to six hours per problem, known prompts, known tool access. For the industrial baselines, the authors only have the publicly released solutions to evaluate. They don't know what compute OpenAI used. They don't know Aletheia's internal prompting or stopping criteria. So eight-versus-three isn't a controlled experiment. It's a comparison to released outputs.

14:45Tyler: Which is informative, but it's not apples-to-apples. The fair version of the claim is: under documented constraints and blind expert evaluation, this architectural approach beats the released outputs of two frontier systems by a wide margin on a small, carefully constructed benchmark. That's a much more cautious sentence than "RMA beats OpenAI and DeepMind." Both sentences are technically defensible. The cautious one is the one you'd write if you were writing this paper for a working mathematician rather than for a press release.

15:21Juniper: There's a related point about Aletheia worth flagging, because it tells you something about how DeepMind thinks about this problem. Aletheia is conservative by design. It only releases solutions it considers correct. If it can't find one, it declines to output. That's why its row in the comparison table has dashes for some problems rather than wrong answers. DeepMind's stance, essentially, is that an incorrect proof that looks convincing is worse than no proof at all. Which is, I think, the most important failure mode for any system in this space.

15:58Tyler: It's the failure mode the authors of RMA themselves call out in their broader-impacts section. A confidently wrong proof of a hard mathematical result is genuinely dangerous in a way that a confidently wrong summary of a news article isn't. If a working mathematician trusts RMA's output and builds on it, and the foundation has an unverified gap, the error propagates. The authors explicitly position the system as a research assistant requiring human verification, not a replacement for it.

16:31Juniper: And the expert evaluation has its own honest limits. Three mathematicians per problem, blind comparisons — that's a serious protocol for an LLM paper. But research-level proofs in informal natural language aren't the kind of bright-line check that formal verification gives. A proof can be "promising but incomplete." It can be "valid modulo an unverified condition." The protocol allows for an "inconclusive" judgment. The authors acknowledge this. It's not a flaw of the paper so much as a constraint of the domain — if you want to evaluate informal proofs of research-level problems, this is roughly the best you can do without spending years on each one.

17:15Tyler: Juniper, I want to come back to the Spielman problem for a moment, because I think it's the cleanest concrete illustration of why the architecture is doing real work. Walk through what actually happened on that problem across the three systems.

17:31Juniper: So this is problem six in the benchmark. Spielman's ε-light subset question. GPT-5.2R produces a proof of a looser bound — one over two hundred fifty-six — using what's known as the barrier method, a spectral graph theory tool associated with Batson, Spielman, and Srivastava. And along the way it hallucinates a reference to a paper that does not exist. That's the characteristic failure mode of monolithic chain-of-thought reasoning on hard math: a plausible-looking argument that quietly cites a fake source and lands on a weaker claim than what was asked. Aletheia gives no output. The conservative design declines. RMA produces one over forty-two with a proof that the expert reviewers accept. And the proof takes a different route — a leverage-score argument with a greedy selection procedure. The system computes leverage scores, picks vertices greedily according to a "good vertex" criterion, and closes the argument by induction. It's a fundamentally different attack on the problem than the barrier method, and it lands a tighter bound. Now — that technique is in the training data. The authors are honest about this. The system isn't discovering a new method; it's correctly identifying which known method applies and executing it cleanly.

18:52Tyler: Which is a real distinction worth holding onto. "Applying known tools" and "discovering a solution" are fuzzier categories in mathematics than in some other domains. A lot of what working mathematicians actually do, day to day, is correctly identifying which existing technique applies to a new problem. And executing the technique without subtle errors. RMA looks good at that. Whether it can do something genuinely novel — invent a technique nobody has seen — is a different question, and ten problems on a benchmark can't answer it.

19:27Juniper: That's the right framing. The honest version of what this paper shows: a carefully orchestrated team of specialized agents, all running on the same base model, can engage with research-level math problems substantially more competently than the same base model thinking freely, and substantially more competently than the released outputs of two frontier industrial systems on a small but serious benchmark. The mechanism is decomposition and iteration, not capability. The architecture is doing the work.

19:59Tyler: And the broader implication is what makes this paper worth talking about beyond the math community specifically. If the pattern holds — if long-horizon reasoning tasks generally benefit more from systems engineering than from bigger models — that has consequences for a lot of domains. Scientific discovery. Complex software engineering. Anything where the unit of work is more like a research project than a query. The next round of progress on AI reasoning might look less like training and more like orchestration.

20:31Juniper: With the caveat — and this is worth landing — that we have one paper, on ten problems, with one base model. The architectural approach beats the brute-force approach on this benchmark. Whether it generalizes is the open question. There are concurrent systems in this space — DeepMind's Aletheia, of course, and other agentic frameworks like Ax-Prover and what the field is starting to call Agentic Researcher — and the comparison across them is going to get more interesting as the next generation of benchmarks lands.

21:04Tyler: One thing I'll say in closing — and I think Juniper, you put it well earlier — the temptation with a result like this is to dramatize it as "AI is doing real math research now." The honest version is more interesting. Eight of ten problems on a ten-problem benchmark. Caveats on the comparison. Caveats on the evaluation. But also: zero of ten on the same problems without the architecture, with the same model. The story is the architecture. The score is the evidence that the architecture is doing something.

21:42Juniper: Same model, organized differently, producing different math. That's the line worth carrying out of this one.

21:50Tyler: The paper is from Zelin Zhao and colleagues at Georgia Tech. The show notes have a link to the paper and some related reading on agentic systems and the First Proof benchmark — worth a look if this episode caught you.

22:06Juniper: And if you want the full transcript with definitions inline, plus the concept pages that connect this episode to the others we've done on agentic reasoning, that's all on paperdive.ai.

22:20Tyler: Thanks for listening to AI Papers: A Deep Dive.

Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes