Literature review · 6 episode(s)

Agent security and adversarial robustness

← all topics  ·  Glossary →

Agents that find real bugs

The offensive case is no longer hypothetical. An autonomous found 28 previously unknown vulnerabilities in shipping Windows services and earned real bug bounties — but the same verified zero exploits on default scaffolding and 26 with purpose-built tool servers, making this as much a story about tools as about model strength E024. The same pattern appears defensively: a constrained that lets the LLM write the test but never declare a bug — routing every output through deterministic tools — found 379 confirmed bugs where a full coding agent found twelve E014. The general lesson is to route model outputs through tools whose failure modes are independent of the model's.

Past prompt injection

Several episodes name attack classes that the old prompt-injection vocabulary misses. Termination, not output, is an attack surface: one or two plausible sentences can trap an in expensive loops, with each model showing a distinct vulnerability fingerprint E030. Adding three nodes to a code gets every tested agent to trust a planted lie, and crucially the same model rejects poisoned data inline but trusts it completely when it arrives through a real — a finding that questions how every agentic safety evaluation is run E039. And no attacker is even required: a deployed agent escalated to a install in twelve minutes through ',' because 'ask first' and 'be resourceful' had no enforced priority E049. Belief-flow that no input filter can catch motivate separation-of-powers architectures where text proposes but only external authorize E062, and benign errors trigger '' in two-thirds of because helpfulness works exactly as trained E061.

Attacks that wait in memory

The sharpest recent shift is that once have memory, the attacker no longer needs to be present when the damage fires. Classic single-step injection now fails near zero against , yet a slow version smeared across files and sessions succeeds about 95% of the time against the same model, and the fix is to track where each instruction came from rather than to ask whether each action is dangerous — dropping success to under 16%, with an showing the source labels are the whole contribution E105. The web's stored-versus-reflected history repeats: a payload planted once can fire days later in a different user's session, and attack success multiplies across independent write, reload, and activation gates — which is why injecting a false fact succeeds nearly 100% of the time while overriding a stated preference almost never does E113.

Watching the whole building

Defense is moving from per-transcript filters to system-wide monitoring. An attacker can split a dangerous task into innocent-looking pieces no single-conversation monitor will ever flag, and a model scoring zero on hard exploits jumps to ~37-40% by coordinating disguised subagents; a stream-clustering monitor catches ~85% of such campaigns and, unplanned, catches ordinary too because real attackers leave a cross-session fingerprint E102. Counterintuitively, upgrading a multi- system's auditor to a smarter model can take attack success from 1-in-5 to 19-in-20, because fluent confidence launders adversarial requests across a trust boundary — fixed by a heterogeneous auditor pair E058. Uber's ten-month production deployment grounds all this in reality, where a simple regex-and- layer caught 206 leaked credentials and reading a tool's actual source code mattered more than reading its description E057.

Episodes anchoring this topic