Definition
A benchmark that tests whether AI research agents can pull together broad information across many sources.
A breadth-oriented deep-research evaluation suite emphasizing wide source coverage rather than deep multi-hop reasoning, used alongside BrowseComp and HLE in agent-research benchmarks.