Literature review · 6 episode(s)

Agentic systems: architecture, tools, and deployment

← all topics  ·  Glossary →

Tools built for agents, not humans

A consistent finding across the corpus is that interfaces designed for humans punish in non-obvious ways. Swapping for a frame-lifetime trace turns a 29-step debugging session into four moves at a third the cost E005, and giving coding agents real execution traces — rather than asking them to deduce runtime behaviour from text — both raises accuracy and *lowers* input by ~25% because they stop fishing through files E012. The same logic plays out in security: a constrained pipeline that lets an LLM write test while leaving bug-declaration to deterministic tools finds 30x more real vulnerabilities than a frontier coding agent with full access E014, and a Windows system that loses to default scaffolding wins decisively when paired with purpose-built binary, , and debugger servers E024.

The deeper pattern shows up in computer-use too: handing *more* tool can drop its score thirteen points because the model has the ability but not the judgment to choose between clicking and tool-calling E066. Even infrastructure is in scope — checkpoint and rollback have been gating real tree search on coding agents, and pulling 5ms rollback out of , , and closes a 30-point gap that looked like a model problem E068. The unifying claim: a lot of what we attributed to model weakness was actually plumbing built for users whose clicks are free.

Harness as the new evaluation surface

A software-engineering that scores 62% in its native setup scores 3.6% when you swap the wrapper E047. A 4B model that fails half the time despite scoring 74% on olympiad math improves on 116 of 126 model-environment combinations when only the interface layer — action realisation, contract handling, — is patched, with the transferring zero-shot across seventeen other models and even beating baselines E071. The architectural lesson is that for deterministic environments, the right place to spend engineering is rarely the .

This reframing pushes 'prompts as parameters' into surprisingly literal territory. Treating a Markdown skill file like a neural-net parameter — with learning rates, validation gates, and rejected-edit buffers — lets a single trained skill document move between and and lift spreadsheet performance sixty points, no retraining involved E078. The implication is that a meaningful chunk of is now an artifact you can train, version, and transport independently of the model.

Memory, milestones, and clarification

Roughly half of web- failures are 'getting stuck' rather than misreading the task, and a milestone-based architecture that doubles as a denser RL training signal lifts a 12B open model from 6% to 43% on web navigation E008. Clarification has its own structure: catching a goal ambiguity at 10% of the recovers near-oracle performance; catching it at 70% is worthless, and late constraint questions can be actively destructive E035. Frontier models systematically miss these windows in distinct ways — and the model that asks *least* often performs best.

Memory turns out to be a harder problem than retrieval framing suggests. Small backbones can route memory operations correctly before they actually understand the contents, producing silent overwrites that fluent hides E023. Even frontier assistants score 92% on 'is this memory stale?' and 30% on the next question that quietly assumes the stale memory E031. The promising fixes are structural: moving adjudication to write time rather than query time E031, and splitting memory into a fast writer plus a slow consolidator that learns 'what to forget' as a transferable skill, producing memory banks an order of magnitude smaller at higher task success E064.

Test-time orchestration and scaling out

When the unit of work is a 40k- interactive session, majority voting breaks; recursive tournament voting on compressed is doing most of the work for current E003. Pushing further runs into a different ceiling: 64 parallel browsers vote on correlated mistakes. Splitting the swarm into evidence-gathering Searchers and a single Navigator that reads a shared graph produces a 1200-to-1 compression ratio that finally lets parallel scaling keep paying off E051. Other approaches put the recursion inside the themselves — teaching a 30B model to delegate to copies of itself via RL closes an gap with 4 and despite a six times smaller E028.

At the orchestration layer, the field is rediscovering compilers. Computer-use that act like interpreters — one LLM call per screenshot — are 10x slower than they need to be; treating the agent loop like a compiler with cached, -checked tools cuts both latency and failure rates on repeated workloads E063. Coordination itself becomes the trainable axis: a small communication hub between five frozen search agents lifts per-agent accuracy from 36% to 58% on , suggesting the layer worth optimising isn't the agents but the layer between them E083.

Serving infrastructure catches up to agent workloads

Throughput dashboards lie for workloads — GPUs run hot while users wait minutes — because thrashing during tool pauses and -GPU coupling strand capacity. Borrowing multi-level feedback queues from classical OS design unifies scheduling and KV eviction under one priority order and produces 1.87x mean latency wins in real deployments E016. A more radical bet: if coding agents have specialised, predictable workloads, bespoke runtimes generated by a team of coding agents can beat by 2-6x on the long tail by exploiting things like using the user's input file as a speculative-decoding draft E027. Both papers point at the same regime change: 'sessions as processes, KV cache as virtual memory' is becoming the working vocabulary for LLM serving.

Episodes anchoring this topic