Definition
A benchmark that tests whether an AI agent can answer real web research questions by browsing.
An evaluation suite measuring open-ended web-research task completion by browsing agents over real web content.
Also called: BrowseComp-ZH
A benchmark that tests whether an AI agent can answer real web research questions by browsing.
An evaluation suite measuring open-ended web-research task completion by browsing agents over real web content.
Also called: BrowseComp-ZH