Glossary · Term

SWE-bench

← all terms

Definition

A benchmark that tests AI agents on real bug-fixing tasks pulled from open-source GitHub projects.

A benchmark of real-world GitHub issues from popular Python projects paired with held-out tests, used to evaluate autonomous coding agents end-to-end.

Also called: SWE-bench Verified

Mentioned in 3 episodes

  1. 068
    The OS Trick That Makes Tree Search Practical for Coding Agents
  2. 047
    When Agent Benchmarks Lie: The Harness Problem in Open-Source AI
  3. 012
    Why AI Coding Agents Keep Trying to Debug Without a Debugger

Related concepts