Literature review · 6 episode(s)

AI for science and mathematical discovery

← all topics  ·  Glossary →

Autonomous experimentation reaches real labs

An AI given a real optical lab, one phrase ('optical computing for AI'), and 21 hours produced a falsification experiment for its own bigger claim mid-run E002. The scaffolding doing the load-bearing work — role-specialised , structured lab-notebook handoffs, a calibrated rig — is what lets it run coherently for a day instead of twenty minutes; the headline 'new physics' framing oversells, but the existence proof is real. A Princeton system made end-to-end with no human in the loop and, when researchers deliberately sabotaged it (removed chip, mislabeled material), the system caught both failures and replanned around them E072. The structural pattern is consistent: lock down 'atom' primitives, let LLMs compose 'molecule' workflows, route every factual claim through an external database. Hallucinations become recoverable rather than preventable.

Research-grade mathematics

A system autonomously cracked nine open Erdős problems — including a thirty-year-old one — with proofs verified by , and the twist is that a twenty-line '' of LLM-plus-compiler matched their elaborate evolutionary search on most problems E067. The verification guarantee is real; the leak is upstream, where LLM judges scoring proof sketches reward confident-sounding citations. A separate result on 2026 — gold-medal level from a 30B open model — suggests olympiad-grade proof reasoning was more about training procedure than scale E048.

's research-mathematics workbench reframes the goal: not an oracle scoring 48% on but a stateful assistant whose value comes from a wrong proof plus its own critique giving a mathematician enough structure to solve a problem E029. The same pattern shows up in academic work: an architecture on solving 8 of 10 problems where the same base model scores 0 — seven specialised agents sharing an append-only whiteboard, with showing every component matters E076. The honest version of the claim is that scaffolding has become a real research multiplier — and the systemic risk is what happens to peer review when plausible 20-page proofs can be produced in minutes but verified in days E029.

Discovery as optimisation substrate

A single universal matches state-of-the-art for $3 and lifts from 32% to ~90% — the substantive idea being that '' (error traces, profiler dumps, failed-test diagnostics) is the LLM-era analog of a , with showing 4-6x faster convergence E065. The expertise traded under this view is from optimisation craft to design.

Adjacent work in scientific computing builds a geometric substrate where numerical-method experience can accumulate — every method gets a coordinate in a unit cube so similarity becomes measurable, eliminating combinatorial explosion in method selection E042. AIRA pushes the same logic to neural architecture search, with eleven exploring 2,300 architectures and autonomously substituting from object detection into a GPT training script for the single largest improvement in the run E053. The candid limitation is the same across all three: agents are doing competent engineering recombination, not inventing new mathematical mechanisms — yet.

Episodes anchoring this topic