Definition
Evaluation and benchmarks is the discipline of measuring AI capabilities and behaviors in a way that’s comparable across models and time. Good benchmarks are surprisingly hard to build: they need to be challenging, well-validated, hard to game, and slow to saturate.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.
- The Political Preferences of AI
- FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
- LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
- LoCoMo: Long-Context Modular Memory for Dialogue State Tracking
- Zoology: Measuring and Improving Recall in Efficient Language Models
- TLA+: A Practical Introduction to Formal Methods for Distributed Systems
- AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents
- Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
- AGENTBENCH: Evaluating LLMs as Agents
- Large Language Models are not Robust Multiple Choice Selectors
- AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
- Are Emergent Abilities of Large Language Models a Mirage?
- Inverse Scaling: When Bigger Isn't Better
- To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning