Literature review · 6 episode(s)

The Harness, Not the Model

On this page

Agent failures live in the interfaceAcross debugging, household simulators, and terminal tasks, the dominant failure mode is malformed interaction with the environment — and fixing the interface beats retraining the model.
Agents are now a systems problemServing stacks, sandboxes, and execution state were built for chat, and a wave of OS-flavored engineering — scheduling, copy-on-write forking, millisecond checkpointing — is unlocking capabilities the models already had.
The harness problem corrupts evaluationBenchmark scores often measure harness-fit rather than capability — the same agent can score 62% or 3.6% depending on the wrapper — so cross-harness comparison and feedback-quality metrics are replacing raw budget as the things worth measuring.
The harness as a trainable parameterSkills, prompts, and scaffolds are being optimized with the discipline of neural-net training — validation gates, bounded edits, fault attribution — and the trained artifacts transfer across harnesses and models.

Agent failures live in the interface

The single most consistent lesson across the agentic episodes is that the model is frequently fine and the plumbing is broken. A 4B model that scores 74% on olympiad math can fail half the time at a trivial household task, and a harness evolved against one model's interface failures transfers, unchanged, to seventeen others — even beating a model fine-tuned specifically for the benchmark E071. The same logic shows up in debugging: agents handed PDB, a tool built for users whose keystrokes are free, flounder for dozens of steps, while an agent-native trace interface finds the same bug in four moves at a third of the cost E005, and giving agents runtime execution traces produces systemic fixes instead of band-aid patches while reducing the tokens spent fishing through files E012.

The most recent work treats this layer as a first-class diagnostic target. Many agent failures are 'silent successes' — the harness marks a task complete while nothing changed in the world — and a compiler-style pipeline over normalized traces can find and repair lifecycle and verification bugs that no prompt edit can reach E121. Even agent planning itself can be checked as a formal object before execution: workflows that pass whole-graph verification beat failing ones by double digits, with the biggest gains going to the cheap models that can't improvise around a broken plan E122.

Agents are now a systems problem

Agents are quietly breaking infrastructure designed for chatbots: GPUs run hot while users wait minutes, because KV caches thrash during tool pauses and schedulers optimize throughput rather than task completion — fixable with multi-level feedback queues lifted from 1970s operating systems E016. The same OS toolkit makes tree search practical for coding agents: hiding checkpoint work inside the LLM call you were already waiting on yields 5-millisecond rollback and pushes RL-training GPU utilization from ~51% to 99% E068. Treating a running agent's entire execution as forkable, replayable data recovers the coordination penalty of parallel agents and enables single-variable counterfactual debugging of agent behavior E096, and compiler-style planning over cached, precondition-checked tools cuts web-agent latency roughly tenfold E063.

The speculative end of this thread asks whether general-purpose infrastructure is itself an artifact of economics: when coding agents can write bespoke serving stacks per workload, the bespoke versions match vLLM on its home turf and beat it severalfold on long-tail workloads E027.

The harness problem corrupts evaluation

If the harness carries this much of the performance, then leaderboard numbers measured inside a single harness are partly measuring the wrong thing. A cross-harness test showed an agent dropping from 62% to 3.6% when its native wrapper was swapped out E047, and on week-long engineering tasks the scaffold alone changes token usage by up to 12x for the same model, making cross-scaffold comparisons close to meaningless E125. The constructive response is to ask what resource actually scales agents: not tokens, tool calls, or dollars — which on real traces predict success worse than guessing the mean — but feedback that is validated, novel, and retained, a quantity that can be estimated mid-run and used to cut off agents that are spinning E097. Realistic task selection matters too: when a benchmark is built from economically weighted professional software with adversarially audited environments, the best frontier agent manages 27% uncapped and 3% on a realistic budget E017.

The harness as a trainable parameter

Once the harness is recognized as load-bearing, it becomes the natural thing to optimize. Treating a skill document like a parameter — bounded step sizes, validation gates, a buffer of rejected edits — produces Markdown files that move sixty points of spreadsheet performance between two entirely different agent systems E078. Retrospective harness optimization extracts a usable training signal from unlabeled failures by asking the cheaper question 'is this better than before?' instead of 'is this correct?' E120, and mining an agent's own successful reasoning traces into reusable primitives converts inconsistent competence into consistent competence with zero retraining E110. The honest caveat comes from fault attribution: LLM-authored skills look fluent but add nothing unless the refinement loop can distinguish 'the instructions were bad' from 'the agent ignored good instructions' — and even then the gains are mostly about helping agents finish at all, not raising per-attempt IQ E132.

Episodes anchoring this topic

When the Model Is Fine and the Plumbing Is Broken: Fixing Agents at the Interface
Established the central thesis that agent failures are dominated by interface bugs, with a harness that improves 116 of 126 model-environment pairs.
Why a Debugger Designed for Humans Is the Wrong Tool for an AI Agent
The cleanest early demonstration that agent-native tool design — not model intelligence — was the bottleneck in autonomous debugging.
When Agent Benchmarks Lie: The Harness Problem in Open-Source AI
Exposed the harness problem in evaluation with the 62%-to-3.6% cross-harness collapse and an infrastructure-first training fix.
When the Agent Says It's Done But Nothing Happened: Debugging the Harness, Not the Model
Showed that 'silent success' failures live in deterministic scaffolding and can be diagnosed and repaired as software.
The OS Trick That Makes Tree Search Practical for Coding Agents
Demonstrated that OS-level checkpoint/rollback speed, not model capability, was blocking tree search for coding agents.
Same Tokens, Same Cost, Wildly Different Results: What Actually Scales in AI Agents
Replaced compute budget with effective feedback as the quantity that actually predicts agent success.