Definition
A multi-task evaluation suite for general-purpose AI agents.
A benchmark used in evaluating open-source agentic search systems, covering varied task families and used alongside BrowseComp and HLE in open-source search-agent comparisons.
A multi-task evaluation suite for general-purpose AI agents.
A benchmark used in evaluating open-source agentic search systems, covering varied task families and used alongside BrowseComp and HLE in open-source search-agent comparisons.