Glossary · Term

SWE-bench

Definition

Plain language

A benchmark that tests AI agents on real bug-fixing tasks pulled from open-source GitHub projects.

As stated in the literature

A benchmark of real-world GitHub issues from popular Python projects paired with held-out tests, used to evaluate autonomous coding agents end-to-end.

Also called: SWE-bench Verified

Why it matters: It tests coding agents on the messy, real-world software work they're being marketed for, rather than on toy programming puzzles.

For example, an agent might be given a real Django bug report and asked to produce a patch that passes the maintainers' actual test suite.

Heard on the show

“… Completion-style suites like SWE-bench score whether the task got done, but an agent that deletes the wrong branch did something, and …”

Episode 195 — Why 'Be Careful' Does Nothing for AI Coding Agents, and What Does

Mentioned in 9 episodes

195
Why 'Be Careful' Does Nothing for AI Coding Agents, and What Does
147
Agents Fail at the Body, Not the Brain: A Self-Rewriting Scaffold That Lifts a 9B Model 44 Points
142
Training a Tiny Model to Run the Plumbing Between an Agent and the World
130
Why AI Agents Coordinate Better Through a Shared Board Than a Boss
126
How Coding Agents Can Mine Their Own Failures Into a Self-Targeting Curriculum
093
A Calibrated Knob for Weak-to-Strong AI Oversight, Tested on Real Code
068
The OS Trick That Makes Tree Search Practical for Coding Agents
047
When Agent Benchmarks Lie: The Harness Problem in Open-Source AI
012
Why AI Coding Agents Keep Trying to Debug Without a Debugger

Related concepts

SWE-bench

Related terms

agent GitHub held-out set Python