AgentBench · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A general-purpose test suite for measuring how well AI agents handle a variety of practical tasks.

As stated in the literature

A multi-domain LLM agent benchmark covering OS, database, knowledge graph, and other tool-using tasks, commonly used as an out-of-distribution evaluation alongside SWE-bench and Tau2-Bench.

Why it matters: It provides a common yardstick so agent improvements can be compared across very different real-world domains, not just one task type.

For example, AgentBench might test the same agent on operating-system commands, SQL queries, and knowledge-graph lookups to see how broadly it generalizes.

Heard on the show

“Others, like AgentBench, report pass-at-one, a single attempt.”

Episode 071 — When the Model Is Fine and the Plumbing Is Broken: Fixing Agents at the Interface

Mentioned in 1 episode

071
When the Model Is Fine and the Plumbing Is Broken: Fixing Agents at the Interface

Related terms

agent knowledge graph OOD SWE-bench Tau2-Bench