Literature review · 6 episode(s)

Agentic systems: harnesses, tools, and interfaces

Tools were built for free clicks

A consistent finding is that the standard developer toolchain was designed for a user whose keystrokes cost nothing, and that assumption breaks the moment every action is an inference cycle. Swapping a human-style step debugger for an interface that promotes the function call to a first-class object lets an agent fix a bug in four moves instead of giving up after twenty-nine E005. The deeper point is perceptual: today's coding agents reason about code statically, deducing runtime behavior from text rather than watching it run, and handing them real execution traces produces fixes that are both more accurate and more systemic — while paradoxically cutting input tokens because the agent stops fishing through files E012.

The corollary is that more capability is not automatically better. Giving a frontier model a richer action space for operating a computer can drop its success rate, because the model now has to choose between clicking and tool-calling at every step and was never taught judgment over that fork E066. The same theme appears at the table: a deterministic wrapper that binds a poker model to the right principle at the right moment beats raw reasoning, and smarter, more reasoning-heavy models often play worse by default E100.

The harness is the product

Several episodes converge on an uncomfortable measurement problem: the same agent can score 62% on its native setup and 3.6% when wrapped in a different harness, meaning a lot of published progress is harness-fit, not capability E047. The constructive flip side is that you can often repair an agent at the interface instead of in the weights — a harness that fixes malformed tool calls, contract violations, and loops, evolved from one small model's failures, transferred unchanged to seventeen others and even beat a model fine-tuned specifically for the benchmark E071.

This reframes where engineering effort belongs. If interface bugs dominate failures, then the durable artifact is the scaffolding, and the open question becomes which environments it generalizes to: deterministic, rule-governed worlds yes, open-ended browsing probably not E071.

Feedback, not compute, is the resource

When you ask what spending more actually buys an agent, the honest answer is: usually waste. Counting tokens, tool calls, and cost measures activity, not progress, and on real traces predicts success worse than guessing the average E097. A matched-budget experiment — identical spend on every axis, with only feedback quality varied — swings success from 27% to 90%, suggesting the resource that genuinely scales agents is feedback that is informative, valid, non-redundant, and retained, multiplied together so that missing any one factor zeroes the rest E097.

This pairs with the selection problem in test-time scaling: when the unit of work is a forty-thousand-token interactive session, you can't pick the best of sixteen rollouts by majority vote, and the real bottleneck is representation — being able to compare and reuse what each attempt found E003.

Sometimes the limit is the OS

Several papers show that classical systems engineering, not bigger models, unblocks agent capabilities. Chat-era serving stacks crater under agentic workloads because of KV-cache thrashing during tool pauses, and borrowing 1970s OS scheduling recovers large latency wins E016. Tree search is rarely run on real coding agents not because it doesn't help — it can add thirty points — but because sandbox checkpoint and rollback are too slow; hiding millisecond-level filesystem versioning inside the LLM call you were already waiting on closes that gap E068.

The broader move is treating agent execution as infrastructure: sessions as processes, the KV cache as virtual memory, and the runtime as something you can fork and replay. This vocabulary now recurs across the systems-flavored episodes, and it reframes a chunk of 'agent reliability' as an infrastructure problem.

Episodes anchoring this topic

When the Model Is Fine and the Plumbing Is Broken: Fixing Agents at the Interface
Showed fixing the interface beats retraining the model across 116 of 126 model-environment pairs.
When Agent Benchmarks Lie: The Harness Problem in Open-Source AI
Exposed that agent benchmark scores largely measure harness-fit, not capability.
Same Tokens, Same Cost, Wildly Different Results: What Actually Scales in AI Agents
Argued validated, novel, retained feedback — not compute — is what scales agents.
Why AI Coding Agents Keep Trying to Debug Without a Debugger
Demonstrated that letting agents watch code run, not just read it, yields systemic fixes with fewer tokens.
Why a Debugger Designed for Humans Is the Wrong Tool for an AI Agent
Isolated interface granularity as the variable that makes debugging cheap for agents.
Why Giving an AI Agent More Tools Can Make It Worse at Using a Computer
Found that more tool capability degrades agents without training in judgment over the click-vs-call fork.