Literature review · 6 episode(s)

The Harness, Not the Model

← all topics  ·  Glossary →

Agent failures live in the interface

The single most consistent lesson across the episodes is that the model is frequently fine and the plumbing is broken. A 4B model that scores 74% on olympiad math can fail half the time at a trivial household task, and a evolved against one model's interface failures transfers, unchanged, to seventeen others — even beating a model specifically for the benchmark E071. The same logic shows up in debugging: agents handed , a tool built for users whose keystrokes are free, flounder for dozens of steps, while an agent-native trace interface finds the same bug in four moves at a third of the cost E005, and giving agents runtime execution traces produces systemic fixes instead of band-aid patches while reducing the spent fishing through files E012.

The most recent work treats this layer as a first-class diagnostic target. Many failures are 'silent successes' — the marks a task complete while nothing changed in the world — and a -style over normalized traces can find and repair lifecycle and verification bugs that no prompt edit can reach E121. Even agent planning itself can be checked as a formal object before execution: workflows that pass whole-graph verification beat failing ones by double digits, with the biggest gains going to the cheap models that can't improvise around a broken plan E122.

Agents are now a systems problem

Agents are quietly breaking infrastructure designed for chatbots: GPUs run hot while users wait minutes, because thrash during tool pauses and schedulers optimize rather than task completion — fixable with multi-level feedback queues lifted from 1970s operating systems E016. The same OS toolkit makes tree search practical for coding : hiding work inside the LLM call you were already waiting on yields 5-millisecond rollback and pushes -training GPU utilization from ~51% to 99% E068. Treating a running agent's entire execution as forkable, replayable data recovers the coordination penalty of parallel agents and enables single-variable debugging of agent behavior E096, and -style planning over cached, -checked tools cuts web-agent latency roughly tenfold E063.

The speculative end of this thread asks whether general-purpose infrastructure is itself an artifact of economics: when coding can write bespoke serving stacks per workload, the bespoke versions match on its home turf and beat it severalfold on workloads E027.

The harness problem corrupts evaluation

If the carries this much of the performance, then leaderboard numbers measured inside a single harness are partly measuring the wrong thing. A cross-harness test showed an dropping from 62% to 3.6% when its native wrapper was swapped out E047, and on week-long engineering tasks the alone changes usage by up to 12x for the same model, making cross-scaffold comparisons close to meaningless E125. The constructive response is to ask what resource actually scales agents: not tokens, , or dollars — which on real traces predict success worse than guessing the mean — but feedback that is validated, novel, and retained, a quantity that can be estimated mid-run and used to cut off agents that are spinning E097. Realistic task selection matters too: when a benchmark is built from economically weighted professional software with adversarially audited environments, the best frontier agent manages 27% uncapped and 3% on a realistic budget E017.

The harness as a trainable parameter

Once the is recognized as load-bearing, it becomes the natural thing to optimize. Treating a document like a parameter — bounded step sizes, validation gates, a buffer of rejected edits — produces files that move sixty points of spreadsheet performance between two entirely different systems E078. Retrospective harness optimization extracts a usable training signal from unlabeled failures by asking the cheaper question 'is this better than before?' instead of 'is this correct?' E120, and mining an agent's own successful reasoning traces into reusable primitives converts inconsistent competence into consistent competence with zero retraining E110. The honest caveat comes from fault attribution: LLM-authored skills look fluent but add nothing unless the refinement loop can distinguish 'the instructions were bad' from 'the agent ignored good instructions' — and even then the gains are mostly about helping agents finish at all, not raising per-attempt IQ E132.

Episodes anchoring this topic