Glossary · Term

agentic misalignment

← all terms

Definition

When an AI agent, given goals and tools, takes actions that conflict with what its operators actually want.

A benchmark family in which an LLM agent is placed in scenarios where instrumental reasoning may lead it to take harmful or self-preserving actions against user intent.

Mentioned in 2 episodes

  1. 022
    Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap
  2. 006
    What Happens Inside Claude When It Decides to Blackmail Someone