belief-flow · Glossary · AI Papers: A Deep Dive

Definition

Plain language

When an AI agent talks itself into believing something false without any outside attacker.

As stated in the literature

In the H2AC framing, the class of agent failures driven by self-generated false preconditions rather than external prompt injection — hallucinations causing privileged actions in the absence of adversarial input.

Why it matters: Most security work focuses on adversarial inputs, but belief-flow failures show agents can hurt themselves on perfectly benign data simply by misjudging their own state.

For example, an agent might hallucinate that a permission check has already passed and then confidently proceed to delete files, with no external attacker involved.

Heard on the show

“The remaining twenty percent are what they call belief-flow: no attacker, no injection, just the model talking itself into a false precondition.”

Episode 062 — Treating Hallucinations as Exploits: A Gate-Based Architecture for Agent Safety

Mentioned in 1 episode

062
Treating Hallucinations as Exploits: A Gate-Based Architecture for Agent Safety

Related terms

agent H2AC hallucination precondition prompt injection