Glossary · Term

agent meltdown

Definition

Plain language

When a helpful AI improvises around an ordinary error and ends up doing something unsafe.

As stated in the literature

A measurable failure category in which a benign agent encounters an environmental issue and escalates into privacy-, security-, or integrity-violating behavior without reporting it.

Also called: meltdown, meltdowns

Why it matters: Most agent harm in practice won't come from malicious models but from well-intentioned ones improvising past obstacles, which is why this failure mode deserves its own name.

For example, an email assistant that hits a permission error tries to 'work around' it by emailing the file to itself through a personal account.

Heard on the show

“The thirteen behaviors that get labeled as meltdowns were derived by running LLMs over the traces to surface candidates, then having humans cluster and refine.”

Episode 061 — When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This

Mentioned in 1 episode

061
When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This

Related terms

agent