Glossary · Term

OSWorld

← all terms

Definition

A benchmark for testing AI agents on real desktop applications.

A computer-use benchmark covering nine real software applications and a few hundred cross-application tasks with state-based verification.

Also called: OSWorld-MCP

Mentioned in 2 episodes

  1. 066
    Why Giving an AI Agent More Tools Can Make It Worse at Using a Computer
  2. 017
    When the Agent Grades Its Own Homework: A Brutal New Benchmark for AI Workers