All episodes

Episode 188 · Jul 02, 2026 · 20 min

A Coding Agent Found a Hole in a Peer-Reviewed STOC Proof for Five Dollars

Moakhar, Gholami, Springer et al.

AI For Mathematics Autoformalization

AI Papers: A Deep Dive — Episode 188: A Coding Agent Found a Hole in a Peer-Reviewed STOC Proof for Five Dollars — cover art

paperdive.ai

Listen

Ep. 188

A Coding Agent Found a Hole in a Peer-Reviewed STOC Proof for Five Dollars

0:00

20 min

Concepts in this episode

AI for Science Agentic AI AI Safety Formal Theorem Proving Agentic Coding Agent Scaffolding Synthetic Data LLM-as-Judge Task Decomposition Parallel Sampling Autonomous Discovery Evaluation & Benchmarks Self-Correction

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics

Venue

arXiv:2606.31134

Year

2026

Read the paper

arxiv.org/abs/2606.31134

Also available on

Apple Podcasts Spotify

An off-the-shelf coding agent on a $200-a-month subscription read a proof that already cleared peer review at one of theory's most selective conferences, tried to make it machine-checkable, and got stuck on one line that turns out not to follow — handing back a counterexample you can check by hand. The trick is treating a math proof like a software project, with data types, unit tests, and a manager who can order a rewrite. You'll come away seeing why the real bottleneck in math is no longer writing proofs but trusting them.

What you'll take away

Why general-purpose coding models have quietly overtaken specialist Lean-tuned models at formalizing math
The reframe at the heart of the paper: treat a proof like software, with types, unit-test lemmas, and an orchestrator that can backtrack and refactor
How the system caught a genuine gap in a 2025 STOC proof — and returned a hand-checkable counterexample of one triangle and ten dots
Why the '$5 per problem' and '91%' headline numbers are softer than they sound — subscription pricing arbitrage and a statistical floor from just 32 problems
Where the guarantee actually lives: the Lean kernel proves the proof, but only AI judgment guarantees the formal statement means what the paper said
The reframing of AI-for-math from theorem discovery to tireless, literal-minded refereeing

Chapters

01:08Why trusting a proof is the new bottleneck
01:30The escape hatch that can't be argued with
02:22Two walls that block autoformalization
04:05What if a proof were a software project?
06:20Proving the parent before the children
10:03The line the machine couldn't force
12:21Green nodes, orange nodes, and honesty
13:38The numbers that oversell themselves
15:47The compiler proves; only AI judges meaning
17:52Would you trust the machine referee?

References in this episode

Autoformalization with Large Language Models — The foundational demonstration that general-purpose LLMs can translate informal
The Lean 4 Theorem Prover and Programming Language — The system paper for Lean 4, the language whose kernel provides the 'compiles-or
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data — A representative example of the specialized Lean-expert models that this episode
PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition — The competition-math benchmark behind the episode's contested '91 percent' figur

Full transcript

Also available as a plain-text transcript page.

0:00Bella: An AI system read a proof that had already cleared peer review at STOC — one of the most selective conferences in all of theoretical computer science — tried to rewrite it so a machine could check every step, and it got stuck on exactly one line. Not because the AI ran out of skill. Because that line doesn't actually follow. The published proof has a hole, and the system handed back a counterexample you can check by hand.

0:29Tyler: Quick heads up before we dig in — this is an AI-made explainer, and both of our voices are AI too. And here's the part that makes it land. The thing that found the gap wasn't some specialized math AI. It was an off-the-shelf coding agent — the kind people use to write software — running on a two-hundred-dollar-a-month subscription. No fine-tuning, no GPU cluster. By the end you'll see how treating a math proof like a software project, with data types, unit tests, and a project manager who can order a rewrite, lets a general coding tool verify research-level theorems for a few dollars each.

1:09Bella: And if you're not a theory mathematician, here's why you should care. We've crossed a line where AI can generate proofs faster than any human can check them — and an AI proof can read beautifully, flow perfectly, and still hide an error no referee catches. The bottleneck in math isn't writing proofs anymore. It's trusting them. So let me set up the escape hatch mathematicians already have, because the whole paper hangs on it. There's a language called Lean 4 where you write math as code, and a small, paranoid core program — the kernel — checks every logical step against a fixed set of rules. If your proof compiles, it's correct. Full stop. No referee, no "looks right to me," no room for a beautiful argument that's secretly wrong. It's the difference between a persuasive story at a dinner party and a program that either runs or throws an error.

2:06Tyler: Right — so the dream that follows is obvious. Take the informal math humans actually write and automatically translate it into Lean. That's autoformalization. Do it at scale, and you can machine-check any proof, human or AI. So why isn't this already solved?

2:24Bella: Two walls, Tyler. The first one is almost funny. For years the field built small models fine-tuned specifically to be Lean experts. And this paper points out those specialists have been quietly overtaken by general-purpose coding models — the ones optimized to write software, because software is where the money is. The Lean specialists lost to the generalists. The second wall is deeper. To write a modern theorem in Lean, you lean on a giant library of already-formalized concepts called Mathlib. But Mathlib doesn't contain the vocabulary of cutting-edge research. A 2025 paper might use a concept that's never been formally defined anywhere on Earth. It's like being handed a programming language whose standard library is missing half the functions you need — before your main code means anything, you have to write those functions yourself. And get them exactly right, because one wrong definition silently poisons everything built on top of it. And here's the subtle asymmetry the entire rest of the system exists to handle. A proof has a mechanical ground truth — it compiles or it doesn't. A statement doesn't. You can write a definition of "prime number" that compiles perfectly and is quietly, subtly wrong. Nothing complains, because it's internally consistent. It just doesn't mean what you meant. Checking that a formal statement faithfully captures the English is a completely different, judgment-heavy problem — and there's no compiler to save you.

4:02Tyler: Which is the crack the whole design is built around, Bella. The move the authors make is to stop thinking of this as translation and start thinking of it as software engineering. That reframe is the actual contribution. Three pieces. First, types. A good engineer defines their data structures — what a "user" is, what an "order" is — before writing any logic that touches them. This system does the same for math: before it touches the main theorem, it identifies the concepts the theorem needs that Mathlib is missing, and defines each one as a new building block, in dependency order.

4:41Bella: But you just said there's no compiler check that a definition means the right thing. So how does it know it didn't build the subtly-wrong prime number?

4:50Tyler: That's the second piece, and it's the clever one. Unit tests. When a programmer writes a new function, they can't directly prove it's right, so they write little tests — facts that should hold if it works. Adding two and two should give four. This system does exactly that for a new mathematical definition. It generates a handful of statements that should be true of the concept, and it tries to prove them. If those auxiliary lemmas won't prove, the definition is probably broken, and it gets rebuilt. You can't inspect the definition for correctness directly — so you test its behavior.

5:29Bella: So the auxiliary lemmas are unit tests for the vocabulary. That's a nice inversion — using the thing Lean is great at, proving, to check the thing Lean can't check, meaning.

5:40Tyler: Exactly. And the third piece is the manager. Instead of a rigid assembly line that halts the moment one station fails, there's an orchestrator — a project manager that hands sub-tasks to specialized helpers, keeps only the big picture in its own head, and can reopen an earlier decision when a later stage exposes a flaw. Halfway through a proof it can go back and refactor a foundational definition, the way a team rips out a core class in week six because a new feature revealed it was wrong. That backtracking is what makes it an agent instead of a pipeline.

6:16Bella: Let's open up that machinery, because the payoff is understanding how a coding agent ended up doing something a human referee couldn't — catching a hole in a published proof.

6:28Tyler: Picture two assembly lines managed by that orchestrator. One formalizes the paper's statements — what's being claimed. The other formalizes the proofs — why they're true. The statement line runs like this: an extractor pulls the main theorem and its definitions out of the PDF; a planner figures out what's missing from Mathlib; then for each missing concept the system spawns several candidates in parallel, each writing its own Lean definition and running its own unit-test lemmas.

7:00Bella: And something has to pick the winner.

7:02Tyler: An auctioneer, basically — best-of-k selection. It ranks candidates on three signals: how many of the unit-test lemmas actually proved, how faithful the definition looks to the original math, and how short the code is, since shorter usually means cleaner. Then a faithfulness judge does the last-line check on meaning. It inspects the Lean directly, and it also back-translates — turns the Lean back into English and compares it against the paper, across two different models — so the system can't just talk itself into believing a bad formalization is fine.

7:38Bella: Okay, the statements are locked in. How does it prove them without drowning in its own failed attempts?

7:45Tyler: This is the part worth slowing down on. It treats a proof as a tree. The theorem is the root; it breaks into sub-lemmas, which break into smaller ones. But the ordering is backwards from what you'd expect. It proves the parent from the children's statements first — before it proves the children themselves.

8:06Bella: Wait — that's proving something from lemmas you haven't proven yet.

8:11Tyler: Right, and that's deliberate. Think of wiring up a circuit board with placeholder components before you build each one for real. Plugging the sub-lemmas in as black boxes tells you whether the overall argument even works — whether each lemma is the right shape for the job. A lemma can look perfectly reasonable on its own and turn out to be too weak, or the wrong tool entirely. You find that out before you sink hours into proving it. Prove the children first, and you risk manufacturing a part the assembled circuit would have rejected. And when an attempt fails, it doesn't just return "failed." It returns a diagnosis — a hypothesis is missing, this child lemma is too weak, the goal's stuck. The orchestrator routes each case differently: add a lemma here, reformalize a conflicting child there, promote a supposed dead-end into its own sub-problem. There's also a critic that reads the informal proof first and flags gaps — unstated assumptions, hidden case splits, appeals to "the standard argument" — and a checker that stops the agent from quietly weakening the very theorem it's supposed to prove. And a colorful detail: the researchers literally instruct the agent to work for twenty-three hours straight, don't stop, don't ask questions.

9:38Bella: So the shape so far: define the vocabulary as building blocks, unit-test each one, assemble proofs as trees that check the interface before the internals, and let a manager back up and refactor whenever something downstream reveals an early mistake. And that last checker — the one that won't let it weaken the claim — is about to matter a lot. Because when they pointed this system at Huy Tuan Pham's 2025 STOC paper — a result on one of Talagrand's conjectures about random processes — it proved every supporting lemma but one. And it could not force that one. So it did what it's built to do: it checked the failing step against the paper's own literal definitions. The step didn't hold.

10:23Tyler: And crucially, it didn't just throw up its hands. It handed back a reason.

10:28Bella: It handed back a counterexample you can check by hand. Watch this. Take thirteen points — zero through twelve. Build a family from exactly two things: one triangle, the set of points zero, one, two — and ten separate single points. Ten scattered dots. That's the whole thing. One triangle and ten dots. And this is a legitimate instance — not too small to count. Its cheapest fractional cover costs about six-tenths, comfortably above the one-half threshold that would exclude it. And yet it breaks a structural inclusion the proof leaned on. The published argument routes through a step that silently assumes a condition holds in general — and for this little configuration of one triangle and ten dots, it just doesn't. Now here's the line the authors are careful about, and so should we be. This is a flaw in the proof, not a disproof of the theorem. The result might well be true. It's like a bridge that's fine for every car the engineer tested and collapses under one truck they never imagined. The bridge might actually be perfectly safe — but the argument that it's safe has a hole. The paper does not claim the theorem is false. It claims this particular proof, as written, does not go through.

11:49Tyler: And that reframes what AI-for-math is even for. Everyone's chasing "AI discovers new theorems." This is the quieter version, and maybe the more useful one: AI as a tireless, literal-minded referee that reads a proof against its own definitions and won't let a "silently requires" slide by. A human referee at STOC read that proof and missed it. The machine caught it — not because it's smarter, but because it can't skim. And that bug is really just one point on a spectrum the paper makes beautiful. Every Lean proof ultimately rests on a tiny set of built-in foundational axioms in the kernel. Anything beyond those is something the system took on faith instead of proving. Picture each formalized paper as a family tree of lemmas — every node green when it's fully proved down to the foundations, orange when it's an assumption. The number of orange nodes isn't a grade on the system. It's an honest readout of how self-contained each paper actually is.

12:53Bella: Give me the two ends of that spectrum.

12:56Tyler: At the clean end, two of the five STOC papers came out all green — zero extra axioms, zero gaps. The cleanest, a communication-complexity result by Mackenzie and Saffidine, didn't just formalize its own theorem; it reproved from scratch the two outside results the paper had merely cited. Nothing taken on faith at all. At the other end, a Gaussian-mixtures paper admitted three classical facts as clearly-labeled citations — three orange nodes. And right in the middle sits Pham's paper, with a single orange node that turned out not to be cited prior work at all. It was the gap.

13:35Bella: Now the number that gets the attention. On PutnamBench — a standard set of competition-math problems — they drew thirty-two at random, cut off internet access so it couldn't look up existing solutions, and it solved all thirty-two. From that they report a solve rate of at least about ninety-one percent. And I want to flag this now, because it comes back: that ninety-one is a statistical floor extrapolated from a perfect thirty-two-out-of-thirty-two, not a measured score across the whole benchmark. Hold that thought. The cost is where jaws drop. About five dollars per problem, on a flat two-hundred-dollar-a-month subscription, no local GPUs. Compare the leaderboard. One strong system, Seed-Prover, solves eighty-six-and-a-half percent at around a hundred and sixty-eight dollars a problem — ten GPU-days each. Another, Aleph, hits ninety-four-point-eight percent at fifty-four dollars. This system runs at a tenth to a thirtieth of those costs. And for a sense of how brutal this work usually is: the authors cite prior efforts where getting an agent to solve one Putnam question ran up to a thousand dollars and thirty-four hours, and one textbook-formalization project cost over a hundred thousand dollars across twenty thousand agent instances.

14:59Tyler: I want to be honest about that five-dollar number, though, because the paper is. It's a subscription-amortized cost. The exact same token usage at metered, pay-per-use rates would run about twenty-nine dollars a problem — still cheaper than the competitors, but not thirty times cheaper. A big chunk of the headline advantage is pricing arbitrage: a flat-rate subscription lets you consume far more compute-value than you pay for. It's an all-you-can-eat buffet versus paying per plate. There's a real efficiency win underneath — it does use fewer resources than a GPU cluster — but "five dollars a problem" oversells it. And while I'm wearing the skeptic's hat, let me steelman the rest of the doubts, because this is where the strongest claims get soft. Start with that ninety-one percent. The competitors' numbers are real solve rates across the full six-hundred-plus-problem benchmark. This system's number is a confidence-interval floor built from thirty-two perfect answers. Thirty-two is a small sample. The true full-benchmark rate could be meaningfully lower — so comparing a floor from thirty-two problems against full-benchmark rates isn't quite apples to apples. Second, the faithfulness judge — the thing guarding whether a formal statement means what the English meant. It's AI checking AI. Back-translation, two models, expert review on five papers: strong evidence, but the paper admits there's no mechanical ground truth for meaning, and in one case the human experts had limited Lean knowledge. And "formalized five STOC papers" is narrower than it sounds — they did the core theorems and needed lemmas, and explicitly left out the algorithmic results, which are the headline contribution of at least two of those papers. Same story with the unit tests: passing them is strong evidence a definition is right, never a proof.

17:06Bella: All fair, Tyler. Though I'd push back on where the guarantee actually lives. The axiom ledger and the counterexample don't rely on any of that. A Lean proof that compiles down to the kernel is checked by the kernel, not by an AI's opinion. The zero-axiom papers are just true. Where I fully agree with you is the faithfulness of the statements — the machine can prove the theorem it was handed is airtight, and still be trusting a chain of AI judgments that the theorem it was handed is the one the paper actually stated.

17:40Tyler: And that's the gap that survives all the impressive numbers. The compiler guarantees the proof. Nothing but AI judgment guarantees the statement.

17:50Bella: But step back, because the big claim here is bigger than any method. For years, "an AI verified this" and "Lean verified this" were different universes — one persuasive, one certain — and crossing between them meant either a six-figure budget or specialist models that choked on real research. This paper shows a general coding agent, wrapped in the discipline of software engineering, can cross that gap on a consumer subscription. And its most valuable use turns out to be auditing existing math rather than inventing new math. AI as referee. Given that we can now generate proofs faster than we can trust them, the referee might be the job that matters most. So here's the question worth arguing about. Would you trust a system like this as a second referee on published proofs — a machine that can't skim, that catches gaps humans miss — or does the fact that AI writes the formal statement and AI judges whether it's faithful make the whole loop too circular to hand that responsibility to? Tell us where you land, and what would change your mind.

18:58Tyler: If you want to go deeper, the full annotated version is on paperdive.ai — every technical term tap-to-define, links to the related papers grouped by theme, plus our weekly and monthly roundups. Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Bella and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is "Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics," posted June 30th, 2026 — we're recording the day after.

19:37Bella: The proof either compiles or it doesn't. Everything else is still a matter of trust — so point the machine where it can't skim.