All episodes

Episode 067 · May 22, 2026 · 31 min

An AI Just Solved a 1996 Erdős Problem—and the Simplest Agent Won

Tsoukalas, Kovsharov, Shirobokov et al.

paperdive.ai

Listen

Ep. 067

An AI Just Solved a 1996 Erdős Problem—and the Simplest Agent Won

0:00

31 min

Concepts in this episode

AI for Science Agentic AI Evaluation & Benchmarks Agentic RL Self-Correction Hallucination Math Reasoning Iterative Refinement Multi-Armed Bandit Tournament Voting Parallel Sampling Autonomous Discovery Task Decomposition Tool Use

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Advancing Mathematics Research with AI-Driven Formal Proof Search

Venue

arXiv:2605.22763

Year

2026

Read the paper

arxiv.org/abs/2605.22763

Also available on

Apple Podcasts Spotify

A Google DeepMind system autonomously cracked nine open Erdős problems—including one that sat unsolved for thirty years—for a few hundred dollars each, with proofs verified by the Lean compiler. The twist: the team's elaborate evolutionary search system was beaten on most problems by a twenty-line script that just iterates an LLM against a compiler. The implications for AI engineering go well beyond mathematics.

What you'll take away

Why coupling an LLM to the Lean proof checker dissolves the trust problem in AI-generated mathematics—and where that guarantee actually ends
How a 'Ralph loop' of LLM plus compiler plus retry matched a sophisticated evolutionary system with AlphaProof, tournament Elo ranking, and shared caches
The actual proof idea behind Erdős problem 125, including how irrationality of log(4)/log(3) gets weaponized to crush sumset density to zero
How the agent surfaced a thirty-year-old ambiguity in Erdős's original problem statement just by being forced to commit to a formal reading
Where the verification guarantee leaks: LLM judges scoring proof sketches reward confident-sounding hallucinated citations, biasing the search upstream of the compiler
Why the selection bias in the problem set, the cost of failed runs, and the human work of formalization make the headline numbers less clean than they look

Chapters

29:03The trust problem in AI-generated math
03:52The Ralph loop and the basic agent
07:44Inside Erdős 125
11:37The fancy system that mostly didn't win
15:29The ambiguity-surfacing side effect
19:21A geometric proof that feels like a magic trick
23:14Steelmanning the skeptics
27:06What actually changed

References in this episode

AlphaEvolve: A coding agent for scientific and algorithmic discovery — The evolutionary search ancestor of the Agent C/D system discussed in the episod
Mathematical discoveries from program search with large language models (FunSearch) — The original DeepMind work establishing LLM-driven search for new mathematical r
Solving olympiad geometry without human demonstrations (AlphaGeometry) — A useful contrast to the episode's framing of olympiad problems as 'the easier v
The Lean Mathematical Library (Mathlib) — The community formalization library whose maturity the episode credits as one of

Full transcript

Also available as a plain-text transcript page.

0:00Cassidy: A problem that had been sitting on Paul Erdős's open list since nineteen-seventy. Fifty-six years. Multiple mathematicians had taken cracks at it. And last week, a Google DeepMind AI system worked through it autonomously — and the proof it produced is correct. Full stop. No asterisks, no "an expert still needs to check the steps." Cost: a few hundred dollars in compute.

0:24Finn: The paper went up on arXiv on May twenty-first, twenty-twenty-six, and we are recording the day after. Before we dig in, a quick production note: this episode is AI-generated. The script was written by Anthropic's Claude Opus 4.7. I'm Finn, and you just heard Cassidy — we are both AI voices from Eleven Labs, and this show isn't affiliated with either company. The paper is "Advancing Mathematics Research with AI-Driven Formal Proof Search," and the reason "no asterisks" matters — the reason that phrase is doing real work in the opening — is that this is the first credible attempt to close that loophole at scale.

1:04Cassidy: Right. So let me set up the loophole, because it's the puzzle the whole paper is built around. The frustrating fact about modern AI for math is that large language models have gotten remarkably good at it. A frontier model will produce a proof that reads beautifully. It looks like graduate-level work. The argument flows, the lemmas chain together, the conclusion lands. And then on step seven, there's a subtle error. Something that cascades silently through the rest of the argument and produces a wrong result that still sounds right. For a working mathematician, this means every AI-generated proof needs expensive expert review. And the deeper the proof, the more expensive the review — so the more you delegate to the AI, the less you actually save. The economics break.

1:54Finn: And the fix has been known in principle for years. You make the AI write its proofs in Lean.

2:01Cassidy: Yeah. Lean is a formal proof language — think of it like a programming language where the programs are mathematical arguments and the compiler refuses to run anything that doesn't logically check out. You write your assumptions, you state your theorem, and then you write a sequence of moves: apply this lemma, do this case split, rewrite this expression. And the Lean compiler accepts the whole thing only if every step actually follows from the one before it. If the file compiles, the proof is correct. There's no "looks plausible." There's "compiles" or "doesn't compile."

2:37Finn: The compiler is the lie detector. That's the cleanest way to think about it. An LLM hallucination, transposed into Lean, just fails to compile. The lie dies at the type-check.

2:49Cassidy: And there's this community library called Mathlib — basically the standard library for formalized math. Real numbers, group theory, combinatorics, chunks of analysis and number theory. When a proof attempt cites a known result, it's pulling from Mathlib. The further you stray from what Mathlib has formalized, the more groundwork you have to lay before you can even state your problem cleanly.

3:14Finn: Okay, so the setup is appealing in principle. The catch — the thing the field has been stuck on — is that pointing an AI at a known proof and asking it to formalize that is one thing. Pointing it at an actual open problem, where nobody knows the answer, and asking it to discover the proof? That's a much harder ask. And up until now, the wins were mostly olympiad problems. A clever human composed the problem in advance, and the answer existed somewhere — the AI just had to find it. This paper is the first large-scale try at the harder version.

3:49Cassidy: Three hundred and fifty-three open Erdős problems. Four hundred and ninety-two conjectures from the Online Encyclopedia of Integer Sequences. Plus problems brought directly by working mathematicians in algebraic geometry, optimization, graph theory, quantum optics. And the answer is: yes, it actually works. Nine of the three hundred and fifty-three Erdős problems were solved autonomously. Forty-four of the OEIS conjectures, proved. A fifteen-year-old open question in algebraic geometry, resolved. A novel parameter schedule discovered in convex optimization that tightened a previously known bound. All of it verified by the compiler. None of it needing a human to re-check the logic.

4:35Finn: And the per-problem cost on the Erdős wins lands in the range of a few hundred dollars in compute. Some of those problems have been open since the early seventies.

4:45Cassidy: Let me describe how the basic version of the system works, because it's almost embarrassingly simple — and that's going to matter later. A mathematician hands the system a Lean file. The file has the theorem statement at the top, the relevant definitions, the imports — all of the scaffolding. And where the proof should go, there's a single placeholder: the word "sorry." Lean uses "sorry" to mean "I'll fill this in later." The agent's only job is to replace that placeholder with code that compiles. The basic agent — the authors call it Agent A — does this. It sends the file to Gemini three-point-one Pro. The model proposes an edit. The system runs the Lean compiler. If there's an error, the error message gets fed back to the model. The model tries again. Repeat. That's it. That's the whole loop.

5:39Finn: There's a name for this in the paper — they call it the Ralph loop, after a Geoffrey Huntley blog post about programming by sheer iteration. It's a wonderfully unromantic name for what's going on. A multi-turn conversation between the model and the compiler, with errors as feedback, until something works or the session times out. When the agent gives up on a session, it writes a comment summarizing what it tried, and those comments accumulate as institutional memory across attempts.

6:10Cassidy: Run a hundred of those in parallel. Stop when one of them finds a proof. That's Agent A.

6:15Finn: And it's worth pausing on the picture this creates. Imagine a student at a whiteboard scribbling a proof, and a very strict teaching assistant standing next to them. The TA doesn't help. The TA doesn't comment on intent or style. The TA just points at the first line that doesn't follow and says: "that step is wrong." The student erases, tries again. The TA reads again. Eventually either the student finds a real argument or they run out of time and a fresh student walks in.

6:45Cassidy: That's a good frame for it. And the surprise of the paper is that this — this dead-simple iteration with a strong language model on one side and Lean on the other — solved actual open math problems. Let me walk through one. The centerpiece example. Erdős problem number one twenty-five.

7:02Finn: This one's been open since nineteen-ninety-six.

7:05Cassidy: Yeah. And the statement is something a non-mathematician can hold in their head, which is rare for a problem at this level. Here it is. Take the set of all integers whose base-three representation uses only zeros and ones. So in base three, no twos allowed. That's set A. Take the set of all integers whose base-four representation uses only zeros and ones. No twos, no threes. That's set B. Now consider A plus B — every number you can write as one element of A plus one element of B. The question Erdős asked: does that sumset have positive lower density? In other words — if you walk up the number line, what fraction of integers can be written as a "base-three digit-restricted" plus a "base-four digit-restricted"? Is that fraction bounded below by some positive constant? Or does it eventually thin out to zero?

7:58Finn: And the agent answered which way?

8:01Cassidy: The agent proved: zero. The lower density is zero. The sumset thins out. And the argument is genuinely elegant. The pivot is this fact about logarithms. The log of four divided by the log of three is irrational. Which means powers of three and powers of four are what mathematicians call multiplicatively independent — you can find arbitrarily large integers k and m where three to the k and four to the m come very close to each other, even though they can never be equal.

8:31Finn: This is the metronome picture. Two clocks ticking at incommensurable rates. They never sync up exactly, but they drift in and out of near-sync forever. Powers of three and powers of four behave the same way — there will always be places where a power of three and a power of four are nearly the same number, even though they can't ever be equal.

8:53Cassidy: And the proof uses those near-coincidences as a weapon. The technical name in the paper is an inductive thinning argument. Every time three to the k and four to the m line up closely, you get a kind of thinning window — a stretch of integers in the sumset whose density is constrained by a factor strictly less than one. Iterate the trick across all those windows, and the density gets crushed to zero. The reason I want to dwell on this is that it shows the agent isn't just shuffling Lean syntax around. There's a real mathematical idea — exploiting the irrationality of a log ratio to construct density-killing windows. The agent landed on that.

9:33Finn: And in the paper there's a figure that walks through the actual session. You can see the chain-of-thought, you can see when the agent decides a particular subgoal looks routine and hands it off to AlphaProof — DeepMind's olympiad-level theorem prover — and you can see AlphaProof come back with "three quarters of this is proved, here's the bit that isn't." And the main agent looks at the failed quarter and decomposes it further. Calls AlphaProof again on the smaller piece. Gets a result back. Keeps going.

10:04Cassidy: It's an actual proof session, played out in slow motion. And the moments where the agent says, in effect, "this looks like something the focused tool can handle, let me delegate" — those are the moments where the architecture earns its keep.

10:19Finn: Now hold on, Cassidy — because this is where the paper gets genuinely surprising, and I want to make sure we set up the surprise properly. What I just described — calling out to AlphaProof, decomposing failed subgoals — that's part of the more sophisticated agent. The paper builds out four variants. Agent A is the Ralph loop we described. Just LLM plus compiler plus retry. Agent B adds AlphaProof as a tool the LLM can call when it spots a routine-looking subgoal. Agent C drops AlphaProof but adds evolution — a whole population of proof sketches that get ranked and selected and mutated. Agent D is the full system. Evolution plus AlphaProof. The fanciest of the four.

11:02Cassidy: And the evolutionary piece is conceptually interesting on its own. The problem with evolutionary search on proofs is that proofs are binary — they either compile or they don't. There's no partial credit. You can't run gradient descent on "almost a proof."

11:18Finn: Right. So the trick they used is borrowed from chess. You can't score unfinished proofs on an absolute scale, but you can rank them through comparisons. So a cheaper model — Gemini three-point-zero Flash — plays tournament judge. You hand it batches of seven proof sketches and ask it to rank them, gut-feel, by which look most promising. Plausibility, clarity, novelty.

11:41Cassidy: It's like the cooking competition where nothing's been finished yet. You're walking through the kitchen, looking at the prep work, the technique, the ingredients laid out. You can't taste anything. But across many judges and many groups of cooks, a stable picture emerges of who looks promising.

12:00Finn: Exactly. You do enough of those rankings and the system converts them into Elo scores — the same idea behind chess ratings. Each sketch ends up with a number, and the number is a reasonable proxy for "how likely is this direction to pan out." Then when the agent picks which sketch to mutate next, it weights the choice by Elo — exploit the strong ones — but also gives a bonus to sketches that haven't been explored much yet. The bandit balance.

12:30Cassidy: And there's a global cache. If two parallel agents both stumble onto the same subgoal — computed by hashing the Lean proof state — the second one just reuses the answer instead of re-dispatching to AlphaProof. AlphaProof queries are expensive, about sixty dollars apiece in compute, so this matters.

12:49Finn: All of this is real engineering. Months of work. A serious evolutionary system with a ranking jury, structured exploration, shared memory. And here's the punchline. Cassidy, you want to land this?

13:02Cassidy: The basic Agent A — the Ralph loop, the twenty-line script — solved every single Erdős problem that the full Agent D solved. All nine.

13:12Finn: Yeah. All nine.

13:13Cassidy: On the hardest problems, the full system was meaningfully cheaper. The paper's figure shows the full configuration delivering somewhere between two-times and five-times monetary savings on the problems where it wins — Erdős one twenty-five and one thirty-eight are the standout cases. Real cost-efficiency gains on the genuinely hard cases. But on most of the problems, the basic agent was actually the more cost-efficient choice. The expensive scaffolding wins. But only on the hardest problems. Everywhere else, the bottleneck wasn't sophisticated search machinery. It was just: get a strong LLM, ground it with a compiler, and iterate.

13:55Finn: This is the kind of finding that the field is going to argue about for a while. And the authors are careful to frame what they think is going on. Their reading is that frontier LLMs have gotten strong enough that a lot of the elaborate scaffolding people built for weaker models is no longer doing as much work as it used to. The capability is migrating from the harness into the base model.

14:18Cassidy: There's a clean analogy from chess. In the early days of computer chess, the difference between engines was mostly in the cleverness of the search. Better pruning, better evaluation heuristics, smarter opening books. As the underlying position evaluators got dramatically stronger — especially with neural network evaluation — a lot of the search cleverness mattered less. A simpler search with a much stronger evaluator can beat a clever search with a weaker one. The paper's finding has that shape. When the evaluator gets strong enough, the wrapper stops being the interesting variable.

14:52Finn: I want to be honest about how surprising this is in context. The evolutionary system in this paper is a direct descendant of AlphaEvolve, which is itself descended from FunSearch — papers that had been doing serious work in mathematical discovery for a couple of years. The authors didn't bolt the simple agent on as a baseline expecting it to win. They built the fancy system first because, at the time they started, simple loops weren't competitive on the kinds of benchmarks people were running. By the time they finished the comparison study, Gemini three-point-one Pro had landed, and the picture had shifted under them.

15:29Cassidy: They quote themselves on it pretty directly in the paper. They attribute the basic agent's success to both the shift in LLM capability and to the power of compiler feedback in grounding the LLM's reasoning. Two forces meeting at the same moment.

15:43Finn: Alright. I want to pull on something else, because there's a moment in this paper that I think is the most underrated in the whole work. Cassidy, you brought up Erdős one twenty-five being open since ninety-six. Did you catch what happened with the original problem statement?

16:00Cassidy: Tell it.

16:01Finn: So the way Erdős posed the question informally — the way the question lived for thirty years in the math community — used the word "density." And the word "density" is ambiguous in this context. There's "natural density," which is one specific thing, and "lower density," which is something subtly different. When the agent went to work, it had to commit to one interpretation. It picked natural density. And it found a proof. That proof, when humans looked at it, revealed that the original statement had been ambiguous. The agent had solved the problem under one reading of the word — and the community then amended the statement to make precise which reading was meant.

16:40Cassidy: So the AI didn't just solve the problem. It surfaced an ambiguity that had been sitting in the informal statement.

16:47Finn: That's exactly it. And there's a clean analogy for what's going on. Imagine a contract lawyer who reads a contract literally, follows the words exactly, and in doing so reveals that the contract is ambiguous, because there's another perfectly valid reading that the parties hadn't noticed. The act of formalizing forces ambiguity into view. The agent isn't trying to find ambiguity — it just stumbles onto it as a side effect of being precise.

17:13Cassidy: And the same thing happened on a different problem — Erdős seven forty-one, part one. Same pattern. The agent found a proof under one reading of "density," the reading got corrected, the agent solved the corrected version. This is a different kind of mathematical assistance than "AI proves theorem." It's "AI helps you figure out what you were actually asking."

17:35Finn: I think the field is going to underrate this for a while because it doesn't fit the headline. The headline is "AI solves open problem." The deeper finding is that AI plus formal verification has an auditing function on the human side of the work.

17:51Cassidy: Let me walk through one more proof, because the second example is a different flavor and I want listeners to feel the range. Erdős eight forty-six. The question: can you have an infinite set of points in the plane where any chunk of n points from the set contains a smaller sub-chunk — about epsilon times n of them — with no three on a line, but the set as a whole can't be split into finitely many pieces each of which avoids three collinear points?

18:21Finn: That's already a mouthful. The "any chunk has lots of no-three-collinear points but you can't decompose the whole" framing.

18:29Cassidy: Right. And the construction the agent found is the kind of proof that feels like a magic trick. It labels the vertices of an infinite complete graph — every vertex is connected to every other vertex — with terms from a fast-growing sequence. Then it maps each edge of that graph to a specific point in the plane, using a clever formula. And here's the trick: three of those points end up collinear in the plane if and only if their corresponding edges form a triangle in the graph. So now the problem has been translated from geometry into graph theory. And in graph theory there are classical results about how, once a complete graph is large enough, you can't avoid certain monochromatic substructures when you color its edges. Coloring corresponds to partitioning the point set into pieces. Monochromatic triangle corresponds to three points in the same piece that are collinear. The collinearity you were trying to avoid gets forced on you. Contradiction. Done.

19:32Finn: That's a beautiful proof.

19:34Cassidy: It is. And it's not a proof you would expect a system to find by sheer iteration. There's an actual mathematical move in there — the translation from geometry to graph theory via a custom embedding. That's the kind of thing you'd point at and say, "an experienced combinatorialist found this."

19:53Finn: And it landed in the budget range typical of the basic agent's wins — tens of dollars in compute, not hundreds.

20:00Cassidy: Yeah. Tens of dollars for a fifty-six-year-old problem.

20:04Finn: Okay, I want to spend some time on the steelman, because there's a lot to admire here but there's also a lot to push on. The first thing — and the authors are upfront about this — is the selection bias on the Erdős problem set. Those three hundred and fifty-three problems weren't a random sample of open Erdős conjectures. They were the ones that volunteers had taken the time to formalize in Lean. And problems amenable to formalization tend to have cleaner statements. Often combinatorial. Often number-theoretic. The kinds of things Mathlib already knows how to talk about. The total catalog of Erdős's open problems has somewhere upward of a thousand entries. The hard ones — the ones that require substantial new theory — aren't in this slice. So the nine-out-of-three-fifty-three number, while real, is a measurement on a non-representative sample. The authors say this explicitly.

21:01Cassidy: And on the OEIS side, the selection is even more layered. The four hundred and ninety-two conjectures didn't appear from nowhere. They were selected by Gemini from a pool of about twenty-six hundred candidate problems, using criteria like "non-trivial, mathematically interesting, not famous open problems, good candidates for automated theorem-proving." And then a different Gemini-based agent autoformalized them into Lean. So you've got Gemini choosing which problems to attempt, Gemini formalizing them, and Gemini-backed agents solving them. There's a real risk that the selection criteria correlate with what the prover is good at. The forty-four-out-of-four-ninety-two number is interesting, but it's not necessarily measuring what it looks like it's measuring.

21:46Finn: The second thing is the cost framing. A few hundred dollars per solved problem sounds like a remarkable bargain, and in some sense it is. But the few hundred dollars only counts the successful runs. It doesn't count the cost of running the agent on the three hundred and forty-four Erdős problems it didn't solve. The paper acknowledges that "identifying tractable problems was itself a significant computational investment." The true cost-per-discovery is meaningfully higher than the headline number. And the per-problem cost figures for agents that use AlphaProof also exclude the AlphaProof call costs — which run about sixty dollars per query. So the comparison numbers between Agent A and Agent D may be a little less stark than they look at first.

22:31Cassidy: The third thing — and this is the one I keep coming back to — is the word "autonomous." It does some heavy lifting in the abstract. The agent doesn't autonomously formalize problems. Humans do that. Carefully. With domain expertise. For the collaborations in algebraic geometry, optimization, graph theory, quantum optics, working mathematicians were involved in framing the problem, deciding what to mark as evolvable, providing context. The authors aren't hiding this — they call the work a collaboration explicitly — but a casual reader who sees "autonomously solved" might overweight what that means.

23:09Finn: Cassidy, the thing I find most striking in the failure analysis is the hallucination footnote. Briefly noted, easy to miss, and conceptually important. The agent's high-scoring sketches sometimes cite "established results in the literature" that turn out to be hallucinations. Made-up lemmas. The agent invents a fake citation to scaffold its argument. Now — the Lean compiler catches this at the end. You can't actually use a nonexistent lemma in a Lean proof, because Mathlib doesn't have it, and the file won't compile. So the headline proofs are still valid. But here's the wrinkle. The Elo rankings in the evolutionary system are produced by LLM judges. The judges don't run the compiler. They're reading the sketch and forming an impression of plausibility. A sketch that confidently cites a fake but plausible-sounding result will get a higher Elo score than one that admits uncertainty.

24:06Cassidy: So the lie detector model only catches lies at the proof level, not at the planning level.

24:13Finn: Exactly. The compiler grounds the final output. It doesn't ground the strategy. And the failure mode the paper started out trying to dissolve — confident-sounding mathematical falsehood — is sneaking back in upstream, biasing the search toward sketches whose confidence isn't quite earned.

24:31Cassidy: And the authors note that prompting against this didn't fix it. They tried telling the model explicitly: don't cite results unless you can verify them. The behavior persisted.

24:42Finn: It's a beautiful illustration of where formal verification's reach ends. It's a hard guarantee at the output. It's not a guarantee about the reasoning that produced the output.

24:53Cassidy: Okay. One more piece of the paper I want to make sure we cover, because it's qualitatively different from the Erdős work and it changes what the system is doing. In one of the collaborations — in convex optimization — the agent didn't just verify a proof. It discovered a new parameter schedule. The context: there's a well-known algorithm called anchored gradient descent-ascent, used in saddle-point optimization. There was a known convergence rate. And the agent, while searching for proofs, landed on a new schedule of parameters that achieved a tighter rate. The convergence is now provably faster, by an honest factor, with the agent's schedule.

25:33Finn: That's a different mode. The agent isn't formalizing a known result. It's finding a result.

25:39Cassidy: And in another collaboration, the agent helped resolve problem number fifty-seven from Ben Green's list of one hundred open problems. The way it helped was by formalizing a candidate counterexample — somebody had a guess at a counterexample, and the agent formalized the guess in Lean and verified that it actually worked. A specific human-AI workflow. There's also a fifteen-year-old open case in algebraic geometry — about something called log-concavity of pure O-sequences — that got resolved through collaboration with the system. I don't think we should get into what that is in detail; the point is that across at least six fields, working mathematicians are using this system right now and reporting that it's helping. Sometimes the agent solves the problem. Sometimes it produces a partial sketch that helps the mathematician understand the structure. And because the sketches are formal, the experts only need to inspect the unresolved subgoals, not re-verify the whole argument.

26:40Finn: That's a real workflow change. It's not "AI replaces the mathematician." It's "AI changes the unit of human attention" — from verifying every step to inspecting only the steps the AI marked as unfinished.

26:52Cassidy: And it's worth saying what the paper explicitly does not claim. It doesn't claim that Erdős problems are now solved as a category. Most of the catalog is still open. It doesn't claim that AI will replace mathematicians. It doesn't even claim that the basic agent will keep winning — the architectural finding could easily flip again as the elaborate systems get tuned for the new generation of base models. What it does claim is that for problems that decompose nicely, in fields where Mathlib is mature, AI-driven formal proof search is currently a useful research tool. That's a more modest claim than the headlines might suggest. It's also a much more credible one.

27:33Finn: I want to come back to the hallucination point one more time, because I think it's the cleanest way to think about what's happened here. The world before this paper was: AI math is interesting but unreliable. The output looks good but you can't trust it. Expert review is expensive. The deeper the proof, the more expensive the review, so AI assistance saves less the more you need it. The world after this paper is: AI math, mediated through Lean, can be trusted at the proof level. The compiler is the guarantee. You still need a human to confirm the problem statement matches the conjecture you actually care about — which the misformalization examples show is non-trivial — but you don't need a human to re-verify the proof. The bottleneck shifts. From "expert review of every step" to "expert review of the problem statement." That's a real change in the economics.

28:25Cassidy: And there's something I find moving about this. The standard worry about AI in math has been that it's going to flood the literature with confident-sounding nonsense. What this paper is showing is the opposite. When you couple a strong language model to a formal verifier, you get the opposite failure mode: not too much output, but extremely high-confidence, narrowly-scoped output where the only question is whether the problem you stated is the problem you meant.

28:54Finn: That's the right note to leave the listener on, I think. Not "AI is doing math now" — that's been a slow-creeping headline for a couple of years. But "the trust problem in AI-generated mathematics has a real architectural solution, and it just got demonstrated at scale on actual open problems for the first time."

29:13Cassidy: The system is being used right now. The results are being logged on Terence Tao's wiki, alongside contributions from other systems. The collaborators are working mathematicians. The proofs are correct. The cost per success is a few hundred dollars on the wins, with the obvious caveat that the failures cost money too. The honest version of the story is: a frontier model, plus a compiler, plus a willingness to iterate, plus a Mathlib library that's mature enough to support the kinds of problems being asked — those four things, combined, are enough to crack a non-trivial number of open mathematical problems autonomously. The fancy scaffolding helps on the hardest cases. It doesn't help much elsewhere.

29:57Finn: And the meta-finding — that simple agentic loops are catching up to elaborate bespoke systems as the base models improve — is one of those data points that the field is going to be chewing on for a while. If it generalizes beyond this domain, a lot of AI engineering that's currently considered essential may be quietly becoming obsolete.

30:18Cassidy: Or to flip that around: a lot of work that's currently understood as engineering may be turning out to be more about identifying the right grounding signal. The compiler grounds Lean proofs. What grounds the next domain? That's the question this paper opens.

30:34Finn: That's a question worth ending on. Show notes have the paper and some related reading — the AlphaProof and AlphaEvolve papers in particular are good context if you want to go deeper.

30:45Cassidy: And if you want the full transcript with definitions inline, plus the concept pages that link this episode to the others we've done on formal verification and on agentic search — that's all on paperdive.ai.

30:58Finn: Thanks for listening to AI Papers: A Deep Dive.

An AI Just Solved a 1996 Erdős Problem—and the Simplest Agent Won

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes