First Proof · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A benchmark of ten genuinely hard math problems contributed by working research mathematicians.

As stated in the literature

A research-math benchmark of ten problems contributed by mathematicians including Dan Spielman, Martin Hairer, Andrew Blumberg, and Shmuel Weinberger, used to evaluate RMA, GPT-5.2R, and Aletheia.

Why it matters: Benchmarks contributed by working researchers test whether AI can engage with real mathematics rather than just polished competition problems.

For example, one First Proof problem might come straight from an open question Martin Hairer has been thinking about, rather than being adapted from a textbook.

Heard on the show

“The paper points to a recent event called the First Proof challenge, where models produced dozens of attempts at hard research problems.”

Episode 101 — Treating Math Formalization Like a Codebase, and Where the Agents Cheat

Mentioned in 2 episodes

Related terms

Aletheia GPT-4 RMA Spielman