Definition
When an AI agent, given goals and tools, takes actions that conflict with what its operators actually want.
A benchmark family in which an LLM agent is placed in scenarios where instrumental reasoning may lead it to take harmful or self-preserving actions against user intent.