Literature review · 6 episode(s)

AI for science and mathematical discovery

Autonomous experimentation reaches real labs

An AI given a real optical lab, one phrase ('optical computing for AI'), and 21 hours produced a falsification experiment for its own bigger claim mid-run E002. The scaffolding doing the load-bearing work — role-specialised agents, structured lab-notebook handoffs, a calibrated rig — is what lets it run coherently for a day instead of twenty minutes; the headline 'new physics' framing oversells, but the existence proof is real. A Princeton system made graphene end-to-end with no human in the loop and, when researchers deliberately sabotaged it (removed chip, mislabeled material), the system caught both failures and replanned around them E072. The structural pattern is consistent: lock down 'atom' primitives, let LLMs compose 'molecule' workflows, route every factual claim through an external database. Hallucinations become recoverable rather than preventable.

Research-grade mathematics

A DeepMind system autonomously cracked nine open Erdős problems — including a thirty-year-old one — with proofs verified by Lean, and the twist is that a twenty-line 'Ralph loop' of LLM-plus-compiler matched their elaborate evolutionary search on most problems E067. The verification guarantee is real; the leak is upstream, where LLM judges scoring proof sketches reward confident-sounding hallucinated citations. A separate result on USAMO 2026 — gold-medal level from a 30B open model — suggests olympiad-grade proof reasoning was more about training procedure than scale E048.

DeepMind's research-mathematics workbench reframes the goal: not an oracle scoring 48% on FrontierMath but a stateful assistant whose value comes from a wrong proof plus its own critique giving a mathematician enough structure to solve a Kourovka Notebook problem E029. The same pattern shows up in academic work: an agent architecture on Claude Opus 4.6 solving 8 of 10 First Proof problems where the same base model scores 0 — seven specialised agents sharing an append-only whiteboard, with ablations showing every component matters E076. The honest version of the claim is that scaffolding has become a real research multiplier — and the systemic risk is what happens to peer review when plausible 20-page proofs can be produced in minutes but verified in days E029.

Discovery as optimisation substrate

A single universal API matches state-of-the-art circle packing for $3 and lifts ARC-AGI from 32% to ~90% — the substantive idea being that 'side information' (error traces, profiler dumps, failed-test diagnostics) is the LLM-era analog of a gradient, with ablations showing 4-6x faster convergence E065. The expertise traded under this view is from optimisation craft to evaluator design.

Adjacent work in scientific computing builds a geometric substrate where numerical-method experience can accumulate — every method gets a coordinate in a unit cube so similarity becomes measurable, eliminating combinatorial explosion in method selection E042. AIRA pushes the same logic to neural architecture search, with eleven agents exploring 2,300 architectures and autonomously substituting focal loss from object detection into a GPT training script for the single largest improvement in the run E053. The candid limitation is the same across all three: agents are doing competent engineering recombination, not inventing new mathematical mechanisms — yet.

Episodes anchoring this topic

067-advancing-mathematics-research-with-ai-driven-formal-proof-s
Showed that a 20-line LLM+Lean loop can match elaborate evolutionary search on real open Erdős problems.
002-end-to-end-autonomous-scientific-discovery-on-a-real-optical
Demonstrated 21-hour coherent autonomous experimentation, including self-falsification of an overclaim.
072-qumus-realization-of-an-embodied-ai-quantum-material-experim
Built the first end-to-end autonomous graphene system and showed recoverable hallucination under deliberate sabotage.
076-rma-an-agentic-system-for-research-level-mathematical-proble
Demonstrated that architecture, not scale, can take Claude Opus from 0 to 8/10 on contributed research problems.
048-achieving-gold-medal-level-olympiad-reasoning-via-simple-and
Matched USAMO 2026 top human score at 30B with a fully published recipe (curriculum, two-stage RL, critique loops).
065-optimize-anything-a-universal-api-for-optimizing-any-text-pa
Unified the AlphaEvolve/FunSearch/GEPA/ADAS family and argued side information is the new gradient.