Glossary · Term

OSWorld

Definition

Plain language

A benchmark for testing AI agents on real desktop applications.

As stated in the literature

A computer-use benchmark covering nine real software applications and a few hundred cross-application tasks with state-based verification.

Also called: OSWorld-MCP, OSWorld-Verified

Why it matters: It's one of the few benchmarks that tests AI agents on real desktop software with state-based grading, much closer to actual computer use.

For example, a task might ask the agent to open a spreadsheet, sort a column, and paste the result into an email draft.

Heard on the show

“… And on OSWorld — which, to ground it, is a benchmark where the agent has to actually accomplish real tasks on a …”

Episode 156 — Why More Human Demonstrations Made a Computer-Use Agent Worse

Mentioned in 5 episodes

156
Why More Human Demonstrations Made a Computer-Use Agent Worse
155
Why a Flawless Demo Makes a Worse Computer-Using Agent, And the Fix
080
How a Two-Agent Trick Unlocked Large-Scale Training for Computer-Use Agents
066
Why Giving an AI Agent More Tools Can Make It Worse at Using a Computer
017
When the Agent Grades Its Own Homework: A Brutal New Benchmark for AI Workers