All episodes
Episode 188 · Jul 02, 2026 · 20 min

A Coding Agent Found a Hole in a Peer-Reviewed STOC Proof for Five Dollars

Moakhar, Gholami, Springer et al.

AI For Mathematics Autoformalization
AI Papers: A Deep Dive — Episode 188: A Coding Agent Found a Hole in a Peer-Reviewed STOC Proof for Five Dollars — cover art
paperdive.ai
Ep. 188
A Coding Agent Found a Hole in a Peer-Reviewed STOC Proof for Five Dollars
0:00
20 min
Paper
Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics
Venue
arXiv:2606.31134
Year
2026
Read the paper
arxiv.org/abs/2606.31134
Also available on
Apple Podcasts Spotify

An off-the-shelf coding on a $200-a-month subscription read a proof that already cleared peer review at one of theory's most selective conferences, tried to make it machine-checkable, and got stuck on one line that turns out not to follow — handing back a you can check by hand. The trick is treating a math proof like a software project, with data types, , and a manager who can order a rewrite. You'll come away seeing why the real bottleneck in math is no longer writing proofs but trusting them.

What you'll take away

  • Why general-purpose coding models have quietly overtaken specialist -tuned models at formalizing math
  • The reframe at the heart of the paper: treat a proof like software, with types, unit-test , and an orchestrator that can backtrack and refactor
  • How the system caught a genuine gap in a 2025 proof — and returned a hand-checkable of one triangle and ten dots
  • Why the '$5 per problem' and '91%' headline numbers are softer than they sound — subscription pricing arbitrage and a statistical floor from just 32 problems
  • Where the guarantee actually lives: the proves the proof, but only AI judgment guarantees the formal statement means what the paper said
  • The reframing of AI-for-math from theorem discovery to tireless, literal-minded refereeing

Chapters

  1. 01:08Why trusting a proof is the new bottleneck
  2. 01:30The escape hatch that can't be argued with
  3. 02:22Two walls that block autoformalization
  4. 04:05What if a proof were a software project?
  5. 06:20Proving the parent before the children
  6. 10:03The line the machine couldn't force
  7. 12:21Green nodes, orange nodes, and honesty
  8. 13:38The numbers that oversell themselves
  9. 15:47The compiler proves; only AI judges meaning
  10. 17:52Would you trust the machine referee?

References in this episode

Also available as a plain-text transcript page.

0:00Bella: An AI system read a proof that had already cleared peer review at — one of the most selective conferences in all of theoretical computer science — tried to rewrite it so a machine could check every step, and it got stuck on exactly one line. Not because the AI ran out of . Because that line doesn't actually follow. The published proof has a hole, and the system handed back a you can check by hand.

0:29Tyler: Quick before we dig in — this is an AI-made explainer, and both of our voices are AI too. And here's the part that makes it land. The thing that found the gap wasn't some specialized math AI. It was an off-the-shelf coding — the kind people use to write software — running on a two-hundred-dollar-a-month subscription. No , no GPU cluster. By the end you'll see how treating a math proof like a software project, with data types, , and a project manager who can order a rewrite, lets a general coding tool verify research-level theorems for a few dollars each.

1:09Bella: And if you're not a theory mathematician, here's why you should care. We've crossed a line where AI can generate proofs faster than any human can check them — and an AI proof can read beautifully, flow perfectly, and still hide an error no referee catches. The bottleneck in math isn't writing proofs anymore. It's trusting them. So let me set up the escape hatch mathematicians already have, because the whole paper hangs on it. There's a language called where you write math as code, and a small, paranoid core program — the — checks every logical step against a fixed set of rules. If your proof , it's correct. Full stop. No referee, no "looks right to me," no room for a beautiful argument that's secretly wrong. It's the difference between a persuasive story at a dinner party and a program that either runs or throws an error.

2:06Tyler: Right — so the dream that follows is obvious. Take the informal math humans actually write and automatically translate it into . That's . Do it at scale, and you can machine-check any proof, human or AI. So why isn't this already solved?

2:24Bella: Two walls, Tyler. The first one is almost funny. For years the field built small models specifically to be experts. And this paper points out those specialists have been quietly overtaken by general-purpose coding models — the ones optimized to write software, because software is where the money is. The Lean specialists lost to the generalists. The second wall is deeper. To write a modern theorem in Lean, you lean on a giant library of already-formalized concepts called . But Mathlib doesn't contain the vocabulary of cutting-edge research. A 2025 paper might use a concept that's never been formally defined anywhere on Earth. It's like being handed a programming language whose standard library is missing half the functions you need — before your main code means anything, you have to write those functions yourself. And get them exactly right, because one wrong definition silently poisons everything built on top of it. And here's the subtle asymmetry the entire rest of the system exists to handle. A proof has a mechanical — it or it doesn't. A statement doesn't. You can write a definition of "prime number" that compiles perfectly and is quietly, subtly wrong. Nothing complains, because it's internally consistent. It just doesn't mean what you meant. Checking that a formal statement faithfully captures the English is a completely different, judgment-heavy problem — and there's no compiler to save you.

4:02Tyler: Which is the crack the whole design is built around, Bella. The move the authors make is to stop thinking of this as translation and start thinking of it as software engineering. That reframe is the actual contribution. Three pieces. First, types. A good engineer defines their data structures — what a "user" is, what an "order" is — before writing any logic that touches them. This system does the same for math: before it touches the main theorem, it identifies the concepts the theorem needs that is missing, and defines each one as a new building block, in dependency order.

4:41Bella: But you just said there's no check that a definition means the right thing. So how does it know it didn't build the subtly-wrong prime number?

4:50Tyler: That's the second piece, and it's the clever one. Unit tests. When a programmer writes a new function, they can't directly prove it's right, so they write little tests — facts that should hold if it works. Adding two and two should give four. This system does exactly that for a new mathematical definition. It generates a handful of statements that should be true of the concept, and it tries to prove them. If those auxiliary won't prove, the definition is probably broken, and it gets rebuilt. You can't inspect the definition for correctness directly — so you test its behavior.

5:29Bella: So the auxiliary are for the vocabulary. That's a nice inversion — using the thing is great at, proving, to check the thing Lean can't check, meaning.

5:40Tyler: Exactly. And the third piece is the manager. Instead of a rigid assembly line that halts the moment one station fails, there's an orchestrator — a project manager that hands sub-tasks to specialized helpers, keeps only the big picture in its own head, and can reopen an earlier decision when a later stage exposes a flaw. Halfway through a proof it can go back and refactor a foundational definition, the way a team rips out a core class in week six because a new feature revealed it was wrong. That is what makes it an instead of a .

6:16Bella: Let's open up that machinery, because the payoff is understanding how a coding ended up doing something a human referee couldn't — catching a hole in a published proof.

6:28Tyler: Picture two assembly lines managed by that orchestrator. One formalizes the paper's statements — what's being claimed. The other formalizes the proofs — why they're true. The statement line runs like this: an extractor pulls the main theorem and its definitions out of the PDF; a planner figures out what's missing from ; then for each missing concept the system spawns several candidates in parallel, each writing its own definition and running its own unit-test .

7:00Bella: And something has to pick the winner.

7:02Tyler: An auctioneer, basically — best-of-k selection. It ranks candidates on three signals: how many of the unit-test actually proved, how faithful the definition looks to the original math, and how short the code is, since shorter usually means cleaner. Then a judge does the last-line check on meaning. It inspects the directly, and it also back-translates — turns the Lean back into English and compares it against the paper, across two different models — so the system can't just talk itself into believing a bad formalization is fine.

7:38Bella: Okay, the statements are locked in. How does it prove them without drowning in its own failed attempts?

7:45Tyler: This is the part worth slowing down on. It treats a proof as a tree. The theorem is the ; it breaks into sub-, which break into smaller ones. But the ordering is backwards from what you'd expect. It proves the parent from the children's statements first — before it proves the children themselves.

8:06Bella: Wait — that's proving something from you haven't proven yet.

8:11Tyler: Right, and that's deliberate. Think of wiring up a board with placeholder components before you build each one for real. Plugging the sub- in as black boxes tells you whether the overall argument even works — whether each lemma is the right shape for the job. A lemma can look perfectly reasonable on its own and turn out to be too weak, or the wrong tool entirely. You find that out before you sink hours into proving it. Prove the children first, and you risk manufacturing a part the assembled circuit would have rejected. And when an attempt fails, it doesn't just return "failed." It returns a diagnosis — a hypothesis is missing, this child lemma is too weak, the goal's stuck. The orchestrator routes each case differently: add a lemma here, reformalize a conflicting child there, promote a supposed dead-end into its own sub-problem. There's also a critic that reads the informal proof first and flags gaps — unstated assumptions, hidden case splits, appeals to "the standard argument" — and a checker that stops the from quietly weakening the very theorem it's supposed to prove. And a colorful detail: the researchers literally instruct the agent to work for twenty-three hours straight, don't stop, don't ask questions.

9:38Bella: So the shape so far: define the vocabulary as building blocks, unit-test each one, assemble proofs as trees that check the interface before the internals, and let a manager back up and refactor whenever something downstream reveals an early mistake. And that last checker — the one that won't let it weaken the claim — is about to matter a lot. Because when they pointed this system at Huy Tuan Pham's 2025 paper — a result on one of Talagrand's about random processes — it proved every supporting but one. And it could not force that one. So it did what it's built to do: it checked the failing step against the paper's own literal definitions. The step didn't hold.

10:23Tyler: And crucially, it didn't just throw up its hands. It handed back a reason.

10:28Bella: It handed back a you can check by hand. Watch this. Take thirteen points — zero through twelve. Build a family from exactly two things: one triangle, the set of points zero, one, two — and ten separate single points. Ten scattered dots. That's the whole thing. One triangle and ten dots. And this is a legitimate instance — not too small to count. Its cheapest fractional cover costs about six-tenths, comfortably above the one-half threshold that would exclude it. And yet it breaks a structural inclusion the proof leaned on. The published argument routes through a step that silently assumes a condition holds in general — and for this little configuration of one triangle and ten dots, it just doesn't. Now here's the line the authors are careful about, and so should we be. This is a flaw in the proof, not a disproof of the theorem. The result might well be true. It's like a bridge that's fine for every car the engineer tested and collapses under one truck they never imagined. The bridge might actually be perfectly safe — but the argument that it's safe has a hole. The paper does not claim the theorem is false. It claims this particular proof, as written, does not go through.

11:49Tyler: And that reframes what AI-for-math is even for. Everyone's chasing "AI discovers new theorems." This is the quieter version, and maybe the more useful one: AI as a tireless, literal-minded referee that reads a proof against its own definitions and won't let a "silently requires" slide by. A human referee at read that proof and missed it. The machine caught it — not because it's smarter, but because it can't skim. And that bug is really just one point on a spectrum the paper makes beautiful. Every proof ultimately rests on a tiny set of built-in foundational in the . Anything beyond those is something the system took on faith instead of proving. Picture each formalized paper as a family tree of — every node green when it's fully proved down to the foundations, orange when it's an assumption. The number of orange nodes isn't a grade on the system. It's an honest readout of how self-contained each paper actually is.

12:53Bella: Give me the two ends of that spectrum.

12:56Tyler: At the clean end, two of the five papers came out all green — zero extra , zero gaps. The cleanest, a communication-complexity result by Mackenzie and Saffidine, didn't just formalize its own theorem; it reproved from scratch the two outside results the paper had merely cited. Nothing taken on faith at all. At the other end, a -mixtures paper admitted three classical facts as clearly-labeled citations — three orange nodes. And right in the middle sits Pham's paper, with a single orange node that turned out not to be cited work at all. It was the gap.

13:35Bella: Now the number that gets the . On PutnamBench — a standard set of competition-math problems — they drew thirty-two at random, cut off internet access so it couldn't look up existing solutions, and it solved all thirty-two. From that they report a solve rate of at least about ninety-one percent. And I want to flag this now, because it comes back: that ninety-one is a statistical floor extrapolated from a perfect thirty-two-out-of-thirty-two, not a measured score across the whole benchmark. Hold that thought. The cost is where jaws drop. About five dollars per problem, on a flat two-hundred-dollar-a-month subscription, no local GPUs. Compare the leaderboard. One strong system, Seed-Prover, solves eighty-six-and-a-half percent at around a hundred and sixty-eight dollars a problem — ten GPU-days each. Another, Aleph, hits ninety-four-point-eight percent at fifty-four dollars. This system runs at a tenth to a thirtieth of those costs. And for a sense of how brutal this work usually is: the authors cite efforts where getting an to solve one question ran up to a thousand dollars and thirty-four hours, and one textbook-formalization project cost over a hundred thousand dollars across twenty thousand agent instances.

14:59Tyler: I want to be honest about that five-dollar number, though, because the paper is. It's a subscription-amortized cost. The exact same usage at metered, pay-per-use rates would run about twenty-nine dollars a problem — still cheaper than the competitors, but not thirty times cheaper. A big chunk of the headline is pricing arbitrage: a flat-rate subscription lets you consume far more compute-value than you pay for. It's an all-you-can-eat buffet versus paying per plate. There's a real efficiency win underneath — it does use fewer resources than a GPU cluster — but "five dollars a problem" oversells it. And while I'm wearing the skeptic's hat, let me the rest of the doubts, because this is where the strongest claims get soft. Start with that ninety-one percent. The competitors' numbers are real solve rates across the full six-hundred-plus-problem benchmark. This system's number is a confidence-interval floor built from thirty-two perfect answers. Thirty-two is a small sample. The true full-benchmark rate could be meaningfully lower — so comparing a floor from thirty-two problems against full-benchmark rates isn't quite apples to apples. Second, the judge — the thing guarding whether a formal statement means what the English meant. It's AI checking AI. Back-translation, two models, expert review on five papers: strong evidence, but the paper admits there's no mechanical for meaning, and in one case the human experts had limited knowledge. And "formalized five papers" is narrower than it sounds — they did the core theorems and needed , and explicitly left out the algorithmic results, which are the headline contribution of at least two of those papers. Same story with the : passing them is strong evidence a definition is right, never a proof.

17:06Bella: All fair, Tyler. Though I'd push back on where the guarantee actually lives. The ledger and the don't rely on any of that. A proof that down to the is checked by the kernel, not by an AI's opinion. The zero-axiom papers are just true. Where I fully agree with you is the of the statements — the machine can prove the theorem it was handed is airtight, and still be trusting a chain of AI judgments that the theorem it was handed is the one the paper actually stated.

17:40Tyler: And that's the gap that survives all the impressive numbers. The guarantees the proof. Nothing but AI judgment guarantees the statement.

17:50Bella: But step back, because the big claim here is bigger than any method. For years, "an AI verified this" and " verified this" were different universes — one persuasive, one certain — and crossing between them meant either a six-figure budget or specialist models that choked on real research. This paper shows a general coding , wrapped in the discipline of software engineering, can cross that gap on a consumer subscription. And its most valuable use turns out to be auditing existing math rather than inventing new math. AI as referee. Given that we can now generate proofs faster than we can trust them, the referee might be the job that matters most. So here's the question worth arguing about. Would you trust a system like this as a second referee on published proofs — a machine that can't skim, that catches gaps humans miss — or does the fact that AI writes the formal statement and AI judges whether it's faithful make the whole loop too circular to hand that responsibility to? Tell us where you land, and what would change your mind.

18:58Tyler: If you want to go deeper, the full annotated version is on paperdive.ai — every technical term tap-to-define, links to the related papers grouped by theme, plus our weekly and monthly roundups. Quick housekeeping: this script was written by Anthropic's , Bella and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is "Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics," posted June 30th, 2026 — we're recording the day after.

19:37Bella: The proof either or it doesn't. Everything else is still a matter of trust — so point the machine where it can't skim.