Glossary · Term

CUA-World

← all terms

Definition

A huge benchmark of realistic professional computer-use tasks across hundreds of pieces of software.

A computer-use agent benchmark spanning 200+ applications and 12,000+ tasks, grounded in GDP-weighted occupational data, with checklist-based verification.

Mentioned in 1 episode

  1. 017
    When the Agent Grades Its Own Homework: A Brutal New Benchmark for AI Workers