Definition
A huge benchmark of realistic professional computer-use tasks across hundreds of pieces of software.
A computer-use agent benchmark spanning 200+ applications and 12,000+ tasks, grounded in GDP-weighted occupational data, with checklist-based verification.