Literature review · 6 episode(s)

Agentic systems and tool use

← all topics  ·  Glossary →

Tools built for agents, not humans

A recurring theme across the corpus is that fail not because they reason poorly but because they are operating tools designed for humans whose actions are free. Replacing a stepwise human debugger with a frame-level execution trace dramatically improves bug-fix accuracy on the same model E005, and giving agents real dynamic execution traces — not just code — lets them propose systemic fixes instead of band-aid patches while paradoxically reading less code E012. The pattern generalises: a constrained pipeline that lets an LLM write symbolic-execution but routes all bug declarations through deterministic tools finds 30x more vulnerabilities than a frontier coding agent given the same projects E014, and the same shape — purpose-built scout/sapper tooling over decompiled binaries — produced 28 zero-days in Windows where production agents found none E024.

The interface story extends to and runtime . Expanding an agent's action space with can actually degrade performance when the model isn't trained to choose between clicking and calling E066, and layers — action realisation, environment contracts, — improve 116 of 126 model-environment combinations without retraining, sometimes beating baselines E071. The lesson the field is converging on: a lot of what looks like a gap is really a plumbing gap.

Search, planning, and long horizons

Classic tricks like majority voting break down once the unit of work is a 40k- interactive session. Pairwise tournament voting on compressed summaries recovers 6–16 points on SWE-Bench and E003, and on deep-research workloads, voting over parallel plateaus because correlated samplers make correlated mistakes — replacing the vote with an that a separate Navigator reads turns parallelism from guessing into jigsaw assembly E051. For long-horizon web agents, milestone-based subgoals provide both runtime scaffolding and a denser RL training signal, lifting a 12B open model from 6% to 43% on web navigation E008.

A harder version of long-horizon work is recursion: training a model to delegate to copies of itself produces on hard crafting tasks (0% → 88%) and lets a 30B open model match frontier reasoners on long-context benchmarks E028. Clarification asking is the dual problem — frontier systematically ask for help at the wrong moment, with goal-level questions cliffing in value after the first 10% of a E035. And the failure to explore at all is its own pathology: RL on task completion silently teaches agents to skip exploration, which a cheap interleaving recipe reverses while improving task success E052.

Agent memory as its own discipline

When a small model runs an 's memory pipeline, it can route add/update/delete operations competently before it can understand what the memories actually say, producing silent overwrites that no end-to-end benchmark catches E023. Stale memory is just as quiet: models can recognise that a stored fact is out of date and then act on it anyway, because off-the-shelf memory frameworks adjudicate at query time rather than write time E031. The fix points toward a real architectural distinction — a fast writer and a slow consolidator on different timescales, where forgetting is the default and retention has to argue for itself, producing memory banks an order of magnitude smaller at higher task success E064.

From workflow search to workflow transfer

Automated workflow search keeps rediscovering the same stereotyped shapes per domain, which means hours of can be replaced by a single LLM call that reads existing workflows as wiring diagrams rather than English E013. The same data-centric reframe shows up in search : ten thousand carefully constructed examples and one-third of the standard pipeline beat an industrially-trained search agent on every benchmark E021. Building verified tool-call data by executing real calls first and writing tasks backwards lifts a 4B open model to levels for about $47k E059, and an infrastructure-first agent training stack with -as-thin-service cuts open-source RL costs roughly 10x while exposing how -fit has been masquerading as E047.

Episodes anchoring this topic