Literature review · 6 episode(s)

Agentic systems and tool use

On this page

Tools built for agents, not humansThe biggest agent gains are coming from rebuilding interfaces — debuggers, sandboxes, security tools, GUI APIs — around the fact that every keystroke is now an inference call.
Search, planning, and long horizonsTest-time scaling for agents now depends on representation — how rollouts get compared, summarised, and re-used — rather than raw parallelism.
Agent memory as its own disciplineMemory pipelines are emerging as a separate engineering object — with their own failure modes, consolidation dynamics, and benchmarks.
From workflow search to workflow transferExpensive per-task search over agent workflows is being displaced by amortised synthesis and data-centric recipes that beat industrial pipelines for a fraction of the cost.

Tools built for agents, not humans

A recurring theme across the corpus is that agents fail not because they reason poorly but because they are operating tools designed for humans whose actions are free. Replacing a stepwise human debugger with a frame-level execution trace dramatically improves bug-fix accuracy on the same model E005, and giving agents real dynamic execution traces — not just code — lets them propose systemic fixes instead of band-aid patches while paradoxically reading less code E012. The pattern generalises: a constrained pipeline that lets an LLM write symbolic-execution harnesses but routes all bug declarations through deterministic tools finds 30x more vulnerabilities than a frontier coding agent given the same projects E014, and the same shape — purpose-built scout/sapper tooling over decompiled binaries — produced 28 zero-days in Windows where production agents found none E024.

The interface story extends to GUI agents and runtime harnesses. Expanding an agent's action space with tool calls can actually degrade performance when the model isn't trained to choose between clicking and calling E066, and harness layers — action realisation, environment contracts, trajectory regulation — improve 116 of 126 model-environment combinations without retraining, sometimes beating fine-tuned baselines E071. The lesson the field is converging on: a lot of what looks like a capability gap is really a plumbing gap.

Search, planning, and long horizons

Classic test-time scaling tricks like majority voting break down once the unit of work is a 40k-token interactive session. Pairwise tournament voting on compressed rollout summaries recovers 6–16 points on SWE-Bench and Terminal-Bench E003, and on deep-research workloads, voting over parallel agents plateaus because correlated samplers make correlated mistakes — replacing the vote with an evidence DAG that a separate Navigator reads turns parallelism from guessing into jigsaw assembly E051. For long-horizon web agents, milestone-based subgoals provide both runtime scaffolding and a denser RL training signal, lifting a 12B open model from 6% to 43% on web navigation E008.

A harder version of long-horizon work is recursion: training a model to delegate to copies of itself produces phase transitions on hard crafting tasks (0% → 88%) and lets a 30B open model match frontier reasoners on long-context benchmarks E028. Clarification asking is the dual problem — frontier agents systematically ask for help at the wrong moment, with goal-level questions cliffing in value after the first 10% of a trajectory E035. And the failure to explore at all is its own pathology: RL on task completion silently teaches agents to skip exploration, which a cheap interleaving recipe reverses while improving task success E052.

Agent memory as its own discipline

When a small model runs an agent's memory pipeline, it can route add/update/delete operations competently before it can understand what the memories actually say, producing silent overwrites that no end-to-end benchmark catches E023. Stale memory is just as quiet: models can recognise that a stored fact is out of date and then act on it anyway, because off-the-shelf memory frameworks adjudicate at query time rather than write time E031. The fix points toward a real architectural distinction — a fast writer and a slow consolidator on different timescales, where forgetting is the default and retention has to argue for itself, producing memory banks an order of magnitude smaller at higher task success E064.

From workflow search to workflow transfer

Automated workflow search keeps rediscovering the same stereotyped shapes per domain, which means hours of MCTS can be replaced by a single LLM call that reads existing workflows as wiring diagrams rather than English E013. The same data-centric reframe shows up in search agents: ten thousand carefully constructed examples and one-third of the standard pipeline beat an industrially-trained search agent on every benchmark E021. Building verified tool-call data by executing real API calls first and writing tasks backwards lifts a 4B open model to Claude Sonnet 4.6 levels for about $47k E059, and an infrastructure-first agent training stack with sandbox-as-thin-service cuts open-source RL costs roughly 10x while exposing how harness-fit has been masquerading as capability E047.

Episodes anchoring this topic

071-adapting-the-interface-not-the-model-runtime-harness-adaptat
Reframed agent failures as interface bugs, with a harness layer that fixes them without retraining.
005-empowering-autonomous-debugging-agents-with-efficient-dynami
Showed that human-designed debuggers waste inference cycles and that agent-native execution traces dominate.
012-dynamic-analysis-enhances-issue-resolution
Established that giving agents runtime behaviour rather than just code produces systemic fixes.
014-guiding-symbolic-execution-with-static-analysis-and-llms-for
Demonstrated the constrained-pipeline pattern where LLMs write harnesses but deterministic tools declare bugs.
047-orchard-an-open-source-agentic-modeling-framework
Surfaced the harness-fit confound and offered infrastructure-first agent training.
051-argus-evidence-assembly-for-scalable-deep-research-agents
Replaced parallel-vote scaling with evidence-graph assembly for deep research.