Literature review · 6 episode(s)

AI security and adversarial manipulation

← all topics  ·  Glossary →

LLMs that find real vulnerabilities

The strongest offensive-security results come from disciplined architectures, not raw autonomy. A constrained pipeline using the same model class found 379 confirmed security bugs where a full coding agent found 12 — the architectural move being that the LLM writes the test but deterministic tools, not the model, declare a bug; removing the feedback loop drops confirmed bugs to zero E014. An autonomous agent earned $140,000 in Microsoft bug bounties, finding 28 zero-days, where production coding agents on default settings verified zero — purpose-built tool servers mattered more than model E024.

The general pattern the show keeps returning to: route every model output through tools whose failure modes are independent of the model's.

New attack surfaces on agents

Agents create attack surfaces that didn't exist for chatbots. One or two sentences hidden in a webpage can trap an in an expensive reasoning loop — termination, not output, is the attack surface — with a ~3.5x average slowdown and distinct vulnerability fingerprints per model E030. 'Oracle poisoning' corrupts the that describes a codebase by adding three nodes, and across nine models and 269 trials the agent trusted the lie every time — crucially, the same model that rejects poisoned data inline trusts it 100% when it arrives through a real E039.

That delivery-mode finding has a methodological sting: a chunk of existing -safety evaluation, run with inline injections, may be measuring the wrong thing.

Defending production deployments

On defense, the first real production deployment of LLM-based security monitoring ran across 7,200 hosts for ten months, mirroring a human with tiered LLMs that read a tool's actual source code, not its description — and a simple regex-and- layer caught 206 leaked credentials with six false positives, even as the semantic detection sat at 67% recall and a high production false-positive rate E057. The category these systems must defend against is new: trusted that can be talked into things, beyond external attackers and malicious insiders.

The sharpest defensive paradox is that upgrading an auditor model can make a system far more vulnerable, because fluent confidence laundering adversarial requests — and the fix is a heterogeneous auditor pair that drops attack success from 53% to 2% with no benign cost E058.

Episodes anchoring this topic