Concept · 5 episode(s)

SWE-bench

← all concepts

Definition

SWE-bench is a benchmark of real GitHub issues drawn from popular open-source Python projects, scored on whether a model’s patch actually resolves the issue when tested. It’s become the standard yardstick for end-to-end coding agents, with the usual caveats about contamination and overfitting.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.