Definition
A general-purpose test suite for measuring how well AI agents handle a variety of practical tasks.
A multi-domain LLM agent benchmark covering OS, database, knowledge graph, and other tool-using tasks, commonly used as an out-of-distribution evaluation alongside SWE-bench and Tau2-Bench.