Definition
A benchmark that tests AI agents on real bug-fixing tasks pulled from open-source GitHub projects.
A benchmark of real-world GitHub issues from popular Python projects paired with held-out tests, used to evaluate autonomous coding agents end-to-end.
Also called: SWE-bench Verified