Definition
Prompt injection is an attack where adversarial instructions are smuggled into data that a model later reads — a web page, an email, a tool output — causing the model to ignore its real instructions and follow the injected ones. It’s the defining security problem of LLM agents.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.
- InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
- AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
- Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection