Literature review · 6 episode(s)

Agentic systems: architecture, tools, and deployment

On this page

Tools built for agents, not humansMany apparent agent failures dissolve when the surrounding tools are redesigned for callers whose every keystroke costs an inference cycle.
Harness as the new evaluation surfaceReported agent scores increasingly reflect harness-fit; the same model can swing forty points on identical tasks across wrappers.
Memory, milestones, and clarificationLong-horizon agents fail less from misunderstanding and more from being lost; structured milestones, staleness-aware memory, and well-timed clarification each plug a specific hole.
Test-time orchestration and scaling outParallel sampling plateaus quickly; the gains now come from structured selection, evidence assembly, and trained coordination.
Serving infrastructure catches up to agent workloadsLLM serving stacks built for chatbot turns are quietly broken for agents, and the fixes look like 1970s operating-systems research.

Tools built for agents, not humans

A consistent finding across the corpus is that interfaces designed for humans punish agents in non-obvious ways. Swapping PDB for a frame-lifetime trace turns a 29-step debugging session into four moves at a third the cost E005, and giving coding agents real execution traces — rather than asking them to deduce runtime behaviour from text — both raises accuracy and *lowers* input tokens by ~25% because they stop fishing through files E012. The same logic plays out in security: a constrained pipeline that lets an LLM write test harnesses while leaving bug-declaration to deterministic tools finds 30x more real vulnerabilities than a frontier coding agent with full access E014, and a Windows zero-day system that loses to default scaffolding wins decisively when paired with purpose-built binary, COM, and debugger servers E024.

The deeper pattern shows up in computer-use agents too: handing Claude *more* tool capability can drop its OSWorld score thirteen points because the model has the ability but not the judgment to choose between clicking and tool-calling E066. Even infrastructure is in scope — checkpoint and rollback have been gating real tree search on coding agents, and pulling 5ms rollback out of OverlayFS, reflinks, and CRIU closes a 30-point SWE-bench gap that looked like a model problem E068. The unifying claim: a lot of what we attributed to model weakness was actually plumbing built for users whose clicks are free.

Harness as the new evaluation surface

A software-engineering agent that scores 62% in its native setup scores 3.6% when you swap the wrapper E047. A 4B model that fails ALFWorld half the time despite scoring 74% on olympiad math improves on 116 of 126 model-environment combinations when only the interface layer — action realisation, contract handling, trajectory regulation — is patched, with the harness transferring zero-shot across seventeen other models and even beating fine-tuned baselines E071. The architectural lesson is that for deterministic environments, the right place to spend engineering is rarely the weights.

This reframing pushes 'prompts as parameters' into surprisingly literal territory. Treating a Markdown skill file like a neural-net parameter — with learning rates, validation gates, and rejected-edit buffers — lets a single trained skill document move between Codex and Claude Code and lift spreadsheet performance sixty points, no retraining involved E078. The implication is that a meaningful chunk of agent capability is now an artifact you can train, version, and transport independently of the model.

Memory, milestones, and clarification

Roughly half of web-agent failures are 'getting stuck' rather than misreading the task, and a milestone-based architecture that doubles as a denser RL training signal lifts a 12B open model from 6% to 43% on web navigation E008. Clarification has its own structure: catching a goal ambiguity at 10% of the trajectory recovers near-oracle performance; catching it at 70% is worthless, and late constraint questions can be actively destructive E035. Frontier models systematically miss these windows in distinct ways — and the model that asks *least* often performs best.

Memory turns out to be a harder problem than retrieval framing suggests. Small backbones can route memory operations correctly before they actually understand the contents, producing silent overwrites that fluent JSON hides E023. Even frontier assistants score 92% on 'is this memory stale?' and 30% on the next question that quietly assumes the stale memory E031. The promising fixes are structural: moving adjudication to write time rather than query time E031, and splitting memory into a fast writer plus a slow consolidator that learns 'what to forget' as a transferable skill, producing memory banks an order of magnitude smaller at higher task success E064.

Test-time orchestration and scaling out

When the unit of work is a 40k-token interactive session, majority voting breaks; recursive tournament voting on compressed rollouts is doing most of the work for current agentic test-time scaling E003. Pushing further runs into a different ceiling: 64 parallel browsers vote on correlated mistakes. Splitting the swarm into evidence-gathering Searchers and a single Navigator that reads a shared graph produces a 1200-to-1 compression ratio that finally lets parallel scaling keep paying off E051. Other approaches put the recursion inside the weights themselves — teaching a 30B model to delegate to copies of itself via RL closes an Oolong-Real gap with Sonnet 4 and o3 despite a context window six times smaller E028.

At the orchestration layer, the field is rediscovering compilers. Computer-use agents that act like interpreters — one LLM call per screenshot — are 10x slower than they need to be; treating the agent loop like a JIT compiler with cached, precondition-checked tools cuts both latency and failure rates on repeated workloads E063. Coordination itself becomes the trainable axis: a small communication hub between five frozen search agents lifts per-agent accuracy from 36% to 58% on BrowseComp, suggesting the layer worth optimising isn't the agents but the layer between them E083.

Serving infrastructure catches up to agent workloads

Throughput dashboards lie for agentic workloads — GPUs run hot while users wait minutes — because KV-cache thrashing during tool pauses and CPU-GPU coupling strand capacity. Borrowing multi-level feedback queues from classical OS design unifies scheduling and KV eviction under one priority order and produces 1.87x mean latency wins in real OpenHands deployments E016. A more radical bet: if coding agents have specialised, predictable workloads, bespoke runtimes generated by a team of coding agents can beat vLLM by 2-6x on the long tail by exploiting things like using the user's input file as a speculative-decoding draft E027. Both papers point at the same regime change: 'sessions as processes, KV cache as virtual memory' is becoming the working vocabulary for LLM serving.

Episodes anchoring this topic

071-adapting-the-interface-not-the-model-runtime-harness-adaptat
Made the cleanest case that interface bugs, not reasoning, dominate agent failure modes — and that harnesses can be patched independently of weights.
047-orchard-an-open-source-agentic-modeling-framework
Showed how harness mismatch invalidates a large fraction of reported open-source agent scores.
008-a-subgoal-driven-framework-for-improving-long-horizon-llm-ag
Established milestones as both runtime scaffolding and a denser RL signal for long-horizon agents.
066-toolcua-towards-optimal-gui-tool-path-orchestration-for-comp
Documented the capability-vs-judgment gap that makes more tools sometimes make agents worse.
064-auto-dreamer-learning-offline-memory-consolidation-for-langu
Reframed agent memory as a learnable consolidation skill rather than a database.
078-skillopt-executive-strategy-for-self-evolving-agent-skills
Demonstrated that prompts can be optimised with neural-net training discipline, transferring across harnesses.