Definition
SWE-bench is a benchmark of real GitHub issues drawn from popular open-source Python projects, scored on whether a model’s patch actually resolves the issue when tested. It’s become the standard yardstick for end-to-end coding agents, with the usual caveats about contamination and overfitting.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.