Definition
A benchmark of multi-step real-world tasks meant to test how well general AI assistants actually perform.
A benchmark of long-horizon assistant tasks requiring multi-step reasoning, tool use, and information aggregation, designed to evaluate general AI capability.