Glossary · Term

Humanity's Last Exam

Definition

Plain language

A very hard benchmark of expert-level questions across many fields used to push top AI systems.

As stated in the literature

A frontier evaluation suite of expert-level multi-domain questions used to probe ceiling performance of capable models.

Also called: HLE

Why it matters: It gives researchers a way to keep measuring progress once frontier models already score near-perfectly on easier benchmarks.

For example, a question might ask for a derivation in graduate-level quantum field theory or an obscure historical translation that only a domain specialist would normally attempt.

Heard on the show

“On a hard trivia question — a game-state puzzle from Humanity's Last Exam — Ultra built a tree with two models each attempting independently, and put Gemini at the top as the aggregator.”

Episode 166 — A Router That Beats the Frontier Models It Calls

Mentioned in 6 episodes

166
A Router That Beats the Frontier Models It Calls
131
Why Autonomous Research Agents Forget Their Own Lessons, and Arbor's Fix
090
How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents
083
Training the Translator: How a Small Communication Model Lets Agent Teams Outperform Themselves
082
Training a Deep Research Agent on 8,000 Synthetic Tasks: The Rubric Tree Trick
021
Ten Thousand Examples Beat the Full Industrial Pipeline for Search Agents

Related terms

linear probe