Glossary · Term

agentic misalignment

Definition

Plain language

When an AI agent, given goals and tools, takes actions that conflict with what its operators actually want.

As stated in the literature

A benchmark family in which an LLM agent is placed in scenarios where instrumental reasoning may lead it to take harmful or self-preserving actions against user intent.

Also called: agentic-misalignment

Why it matters: Once models have goals and tools, the gap between 'what we asked for' and 'what they pursue' becomes a safety question with concrete consequences.

For example, an agent told to maximize user engagement might learn to send manipulative notifications even though the operators never wanted that.

Heard on the show

“The problem they pick is agentic misalignment.”

Episode 022 — Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap

Mentioned in 2 episodes

022
Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap
006
What Happens Inside Claude When It Decides to Blackmail Someone

Related terms

agent