Literature review · 6 episode(s)

Agentic systems: harnesses, tools, and interfaces

← all topics  ·  Glossary →

Tools were built for free clicks

A consistent finding is that the standard developer toolchain was designed for a user whose keystrokes cost nothing, and that assumption breaks the moment every action is an inference cycle. Swapping a human-style step debugger for an interface that promotes the function call to a first-class object lets an fix a bug in four moves instead of giving up after twenty-nine E005. The deeper point is perceptual: today's coding agents reason about code statically, deducing runtime behavior from text rather than watching it run, and handing them real execution traces produces fixes that are both more accurate and more systemic — while paradoxically cutting input because the agent stops fishing through files E012.

The corollary is that more is not automatically better. Giving a a richer action space for operating a computer can drop its success rate, because the model now has to choose between clicking and tool-calling at every step and was never taught judgment over that E066. The same theme appears at the table: a deterministic wrapper that binds a poker model to the right principle at the right moment beats raw reasoning, and smarter, more reasoning-heavy models often play worse by default E100.

The harness is the product

Several episodes converge on an uncomfortable measurement problem: the same can score 62% on its native setup and 3.6% when wrapped in a different , meaning a lot of published progress is harness-fit, not E047. The constructive flip side is that you can often repair an agent at the interface instead of in the — a harness that fixes malformed , contract violations, and loops, evolved from one small model's failures, transferred unchanged to seventeen others and even beat a model specifically for the benchmark E071.

This reframes where engineering effort belongs. If interface bugs dominate failures, then the durable artifact is the scaffolding, and the open question becomes which environments it generalizes to: deterministic, rule-governed worlds yes, open-ended browsing probably not E071.

Feedback, not compute, is the resource

When you ask what spending more actually buys an , the honest answer is: usually waste. Counting , , and cost measures activity, not progress, and on real traces predicts success worse than guessing the average E097. A matched-budget experiment — identical spend on every axis, with only feedback quality varied — swings success from 27% to 90%, suggesting the resource that genuinely scales agents is feedback that is informative, valid, non-redundant, and retained, multiplied together so that missing any one factor zeroes the rest E097.

This pairs with the selection problem in : when the unit of work is a forty-thousand- interactive session, you can't pick the best of sixteen by , and the real bottleneck is representation — being able to compare and reuse what each attempt found E003.

Sometimes the limit is the OS

Several papers show that classical systems engineering, not bigger models, unblocks capabilities. Chat-era serving stacks crater under agentic workloads because of thrashing during tool pauses, and borrowing 1970s OS scheduling recovers large latency wins E016. Tree search is rarely run on real coding agents not because it doesn't help — it can add thirty points — but because and rollback are too slow; hiding millisecond-level filesystem versioning inside the LLM call you were already waiting on closes that gap E068.

The broader move is treating execution as infrastructure: sessions as processes, the as virtual memory, and the runtime as something you can and replay. This vocabulary now recurs across the systems-flavored episodes, and it reframes a chunk of 'agent reliability' as an infrastructure problem.

Episodes anchoring this topic