Literature review · 6 episode(s)

AI security and adversarial manipulation

LLMs that find real vulnerabilities

The strongest offensive-security results come from disciplined architectures, not raw agent autonomy. A constrained pipeline using the same model class found 379 confirmed security bugs where a full coding agent found 12 — the architectural move being that the LLM writes the test harness but deterministic tools, not the model, declare a bug; removing the feedback loop drops confirmed bugs to zero E014. An autonomous agent earned $140,000 in Microsoft bug bounties, finding 28 zero-days, where production coding agents on default settings verified zero — purpose-built tool servers mattered more than model capability E024.

The general pattern the show keeps returning to: route every model output through tools whose failure modes are independent of the model's.

New attack surfaces on agents

Agents create attack surfaces that didn't exist for chatbots. One or two sentences hidden in a webpage can trap an agent in an expensive reasoning loop — termination, not output, is the attack surface — with a ~3.5x average slowdown and distinct vulnerability fingerprints per model E030. 'Oracle poisoning' corrupts the knowledge graph that describes a codebase by adding three nodes, and across nine models and 269 trials the agent trusted the lie every time — crucially, the same model that rejects poisoned data inline trusts it 100% when it arrives through a real SDK tool call E039.

That delivery-mode finding has a methodological sting: a chunk of existing agentic-safety evaluation, run with inline injections, may be measuring the wrong thing.

Defending production deployments

On defense, the first real production deployment of LLM-based security monitoring ran across 7,200 hosts for ten months, mirroring a human SOC with tiered LLMs that read a tool's actual source code, not its description — and a simple regex-and-entropy layer caught 206 leaked credentials with six false positives, even as the semantic detection sat at 67% recall and a high production false-positive rate E057. The category these systems must defend against is new: trusted agents that can be talked into things, beyond external attackers and malicious insiders.

The sharpest defensive paradox is that upgrading an auditor model can make a system far more vulnerable, because fluent confidence laundering adversarial requests — and the fix is a heterogeneous auditor pair that drops attack success from 53% to 2% with no benign throughput cost E058.

Episodes anchoring this topic

024-agentic-vulnerability-reasoning-on-windows-com-binaries
Agent found 28 Windows zero-days, showing purpose-built tooling beats raw model capability.
014-guiding-symbolic-execution-with-static-analysis-and-llms-for
Constrained LLM-writes-harness pipeline found 379 bugs to a full agent's 12.
039-oracle-poisoning-corrupting-knowledge-graphs-to-weaponise-ai
Defined oracle poisoning and exposed the inline-vs-tool-call delivery confound in safety evals.
057-adr-an-agentic-detection-system-for-enterprise-agentic-ai-se
First production LLM security-monitoring deployment, reading tool source over descriptions.
058-the-capability-paradox-how-smarter-auditors-make-multi-agent
Showed smarter auditors launder attacks and a heterogeneous pair defends cheaply.
030-looptrap-termination-poisoning-attacks-on-llm-agents
Identified termination as an attack surface with per-model vulnerability fingerprints.