xbench · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A multi-task evaluation suite for general-purpose AI agents.

As stated in the literature

A benchmark used in evaluating open-source agentic search systems, covering varied task families and used alongside BrowseComp and HLE in open-source search-agent comparisons.

Why it matters: Having multiple complementary benchmarks reduces the risk that one number gets gamed and gives a fuller picture of agent capability.

For example, xbench results often appear alongside BrowseComp scores when researchers compare open-source search agents.

Heard on the show

“On xbench, seventy-eight versus seventy-five.”

Episode 021 — Ten Thousand Examples Beat the Full Industrial Pipeline for Search Agents

Mentioned in 1 episode

021
Ten Thousand Examples Beat the Full Industrial Pipeline for Search Agents

Related terms

agent BrowseComp Humanity's Last Exam