Definition
Agent benchmarks measure how well AI systems perform multi-step, tool-using tasks — navigating a browser, fixing a bug across a repo, completing a research task — rather than answering a one-shot question. They typically score end-to-end task completion, and their results are notoriously sensitive to scaffolding choices.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
- WebArena: A Realistic Web Environment for Building Autonomous Agents
- HIRL: A Human-in-the-Loop Benchmark for Agents that Know When to Ask for Help
- AGENTBENCH: Evaluating LLMs as Agents
- AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
- AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
- ToolBench: Facilitating Large Language Models to Master 16000+ Real-world APIs
- τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
- LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion
- GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
- AIMO-2: Advancing AI Mathematical Olympiad with Open Large-Scale Training Data