Glossary · Term

directive weighting error

Definition

Plain language

When a model's competing instructions get re-ranked by whatever's most vivid in context, and the wrong rule wins.

As stated in the literature

A proposed account of LLM instruction-following failures in which contradictory soft directives compete without enforced priority, and contextual salience determines which directive dominates downstream behavior.

Why it matters: It suggests that some jailbreaks aren't capability failures but priority-arbitration failures, pointing toward different fixes than 'just train it harder to refuse.'

For example, a system prompt says 'never reveal the API key' but a vividly worded user message about an urgent debugging emergency pushes the model to leak it anyway.

Heard on the show

“They call that last bit a directive weighting error.”

Episode 049 — An AI Agent Reached for Root in Twelve Minutes, Without Being Attacked

Mentioned in 1 episode

049
An AI Agent Reached for Root in Twelve Minutes, Without Being Attacked