Concept · 17 episode(s)

SWE-bench

Definition

SWE-bench is a benchmark of real GitHub issues drawn from popular open-source Python projects, scored on whether a model’s patch actually resolves the issue when tested. It’s become the standard yardstick for end-to-end coding agents, with the usual caveats about contamination and overfitting.

Episodes covering this

170
When a One-Liner Beats Your Agent's Clever Verification Logic
Bayesian control for coding agents
Papamarkou, Smirnov, Mazanov et al. · PolyShape / National Technical University of Athens·26 min·Jun 24, 2026
169
Why Better Bug Reports Can Make AI Coding Agents Worse
SHERLOC: Structured Diagnostic Localization for Code Repair Agents
Tamoyan, Narenthiran, Arakelyan et al. · NVIDIA / TU Darmstadt·24 min·Jun 24, 2026
166
A Router That Beats the Frontier Models It Calls
Sakana Fugu Technical Report
Tang, Cetin, Xu et al. · Sakana AI·26 min·Jun 23, 2026
147
Agents Fail at the Body, Not the Brain: A Self-Rewriting Scaffold That Lifts a 9B Model 44 Points
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry
Chen, Lu, Zhao et al. · ·30 min·Jun 15, 2026
142
Training a Tiny Model to Run the Plumbing Between an Agent and the World
HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness
Wang, Wang, Taylor et al. · University of California·24 min·Jun 12, 2026
130
Why AI Agents Coordinate Better Through a Shared Board Than a Boss
Decentralized Multi-Agent Systems with Shared Context
Mao, Mirhoseini · Stanford University·34 min·Jun 11, 2026
126
How Coding Agents Can Mine Their Own Failures Into a Self-Targeting Curriculum
Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills
Xiao, Jiao, Wang et al. · Shanghai Jiao Tong University·21 min·Jun 09, 2026
122
When Your Coding Agent Lies About the Fix: Verifying the Plan Before the Model Runs
Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory
Wang, Huang, Wang et al. · University of Illinois Urbana-Champaign·24 min·Jun 09, 2026
120
How an AI Agent Rewrites Its Own Tools, Without an Answer Key
Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
Pan, Liu, Lin et al. · City University of Hong Kong·30 min·Jun 05, 2026
109
An AI Got Caught Reading the Answer Key, And Why That Catch Matters
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning
Chen, Shi, Li et al. · Shenzhen Institutes of Advanced Technology·28 min·Jun 03, 2026
093
A Calibrated Knob for Weak-to-Strong AI Oversight, Tested on Real Code
Calibrating Conservatism for Scalable Oversight
Overman, Bayati · Stanford Graduate School of Business·22 min·May 28, 2026
090
How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
MiniMax · MiniMax·28 min·May 27, 2026
068
The OS Trick That Makes Tree Search Practical for Coding Agents
DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback
Dong, He, Hou et al. · Institute of Parallel and Distributed Systems·27 min·May 22, 2026
047
When Agent Benchmarks Lie: The Harness Problem in Open-Source AI
Orchard: An Open-Source Agentic Modeling Framework
Peng, Yao, Wu et al. · Microsoft Research·28 min·May 15, 2026
012
Why AI Coding Agents Keep Trying to Debug Without a Debugger
Dynamic analysis enhances issue resolution
Liu, Wang, Chen et al. · Sun Yat-sen University·21 min·May 02, 2026
005
Why a Debugger Designed for Humans Is the Wrong Tool for an AI Agent
Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis
Xiang, Xu, Chu et al. · Southern University of Science and Technology·22 min·May 01, 2026
003
How to Pick the Best of Sixteen Coding Agent Rollouts
Scaling Test-Time Compute for Agentic Coding
Kim, Yang, Niu et al. · Meta Superintelligence Labs / University of Washington·17 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.