Definition
A benchmark for testing AI agents on real desktop applications.
A computer-use benchmark covering nine real software applications and a few hundred cross-application tasks with state-based verification.
Also called: OSWorld-MCP
Mentioned in 2 episodes
066
017