Glossary · Term

AgentBench

← all terms

Definition

A general-purpose test suite for measuring how well AI agents handle a variety of practical tasks.

A multi-domain LLM agent benchmark covering OS, database, knowledge graph, and other tool-using tasks, commonly used as an out-of-distribution evaluation alongside SWE-bench and Tau2-Bench.

Mentioned in 1 episode

  1. 071
    When the Model Is Fine and the Plumbing Is Broken: Fixing Agents at the Interface