Mathematics: the clearest signal
Coupling an LLM to the Lean compiler turned a chunk of mathematical AI from 'plausible-looking text' into externally verified results, and a Google DeepMind system solved nine open Erdős problems including one open since 1996 — with the twist that a 20-line 'Ralph loop' (LLM plus compiler plus retry) matched a much more sophisticated evolutionary search E067. A 30B open model trained with a reverse-perplexity curriculum and a two-stage RL progression reached USAMO-gold-equivalent on proof writing E048. And on research-level mathematics, decomposing a single backbone into seven coordinated agents over a shared whiteboard takes the same model from 0% to 8/10 on First Proof problems — a strong argument that organisation, not scale, is the contested axis here E076. DeepMind's broader 'co-mathematician' framing pushes against benchmark-as-progress-metric: the more important value may be helping mathematicians fail faster on dead ends and surface ambiguities in old problem statements E029.
Agents running real instruments
A robot system made graphene end-to-end autonomously and caught two deliberately sabotaged experiments, with the architectural pattern that matters being locked-down primitive 'atoms', LLM-composable 'molecules', and freely-designed 'assembly' procedures E072. A separate system ran an optical lab for 21 hours and produced a credible XOR experiment showing that an interferometer can carry pairwise information structurally analogous to attention — though the framing partly works because a Transformer might find Transformer-shaped patterns E002. In numerical scientific computing, giving every method a geometric address in a unit cube and exploiting the conditional-independence structure of method choice lets an agent one-shot a 1968 NASA re-entry problem and discover a spectral PINN E042. Across all of these, the recurring claim is that the new bottleneck is wet-lab speed and hardware iteration cycles.
The shape of LLM-driven optimisation
A Berkeley unification argues that AlphaEvolve, FunSearch, GEPA, and ADAS are all running the same algorithm, and that 'side information' (error traces, profiler dumps, failed-test diagnostics) is the LLM-era analog of a gradient — producing state-of-the-art circle packing for $3 and lifting ARC-AGI from 32% to ~90% E065. Agent-driven neural architecture search explores spaces rigid Bayesian or evolutionary methods can't, including an agent that spontaneously imported focal loss from object detection into a GPT training script E053. Verified distributed-systems code can now be synthesised in ten hours instead of nine months, with the proof obligation pushing toward representations that are both easier to verify and faster to run E075. The honest qualifier across all of these: most of the intellectual heavy lifting still lives in the proposer model and the evaluator design, not the orchestration layer.
Episodes anchoring this topic
- 067-advancing-mathematics-research-with-ai-driven-formal-proof-s
Solved open Erdős problems autonomously with Lean-verified proofs.
- 048-achieving-gold-medal-level-olympiad-reasoning-via-simple-and
Showed olympiad proof reasoning is reachable at 30B with the right recipe.
- 076-rma-an-agentic-system-for-research-level-mathematical-proble
Demonstrated that agent decomposition can take a 0%-baseline model to 80% on research math.
- 072-qumus-realization-of-an-embodied-ai-quantum-material-experim
First end-to-end autonomous synthesis of graphene with explicit hallucination recovery.
- 065-optimize-anything-a-universal-api-for-optimizing-any-text-pa
Unified evolutionary LLM optimisation methods around side-information as gradient analog.
- 002-end-to-end-autonomous-scientific-discovery-on-a-real-optical
Ran a real optics lab for 21 hours with role-specialised agents and a structured lab notebook.