Scale-SWE · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A high-scoring software-engineering agent system used as a baseline.

As stated in the literature

An agentic coding system scoring around 64% on SWE-bench Verified under its native harness but producing malformed output under different harnesses; used as evidence of harness-fragility in open-source agent training.

Why it matters: It illustrates how brittle headline agent scores can be — a small change in the surrounding harness can erase most of the apparent capability.

For example, Scale-SWE patches roughly two-thirds of real GitHub issues when run inside its own scaffolding but produces unparseable output when moved to a different harness.

Heard on the show

“There's a system called Scale-SWE that scores 64 percent on its native harness.”

Episode 047 — When Agent Benchmarks Lie: The Harness Problem in Open-Source AI

Mentioned in 1 episode

047
When Agent Benchmarks Lie: The Harness Problem in Open-Source AI

Related terms

agent agent harness agentic coding harness SWE-bench