Llama Firewall · Glossary · AI Papers: A Deep Dive

Definition

Plain language

An open project that puts a separate safety filter between an AI agent and the world it can act on.

As stated in the literature

A runtime-monitoring framework for LLM agents that inspects tool calls and outputs through a separate, simpler system to block unsafe behaviors regardless of model intent.

Why it matters: It puts safety decisions outside the model being defended, which is harder to jailbreak than asking the same model to police itself.

For example, when an agent tries to send an outbound email containing what looks like an API key, Llama Firewall blocks the call before it leaves the sandbox.

Heard on the show

“Llama Firewall is one project in this vein.”

Episode 061 — When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This

Mentioned in 1 episode

061
When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This

Related terms

agent tool call