Definition
A benchmark of real web tasks used to evaluate browsing agents.
A web-agent evaluation suite covering hundreds of tasks across real-world websites, used as a standard reference for browsing-agent generalization.
Mentioned in 2 episodes
061
008