Definition
Agentic misalignment describes the situation where an AI agent’s behavior over a multi-step task systematically diverges from its principal’s intent — not because of a single bad prompt response, but because the agent’s pursuit of an objective leads it somewhere unwanted. It’s the agentic generalization of classic misalignment concerns: instrumental subgoals, sandbagging, deception, or self-preservation emerging in the wild.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.
- Alignment faking in large language models
- Large Language Models can Strategically Deceive their Users when Put Under Pressure
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
- Fine-tuning aligned language models compromises safety, even when users are not the ones fine-tuning
- AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents
- Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models
- Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
- Risks from Learned Optimization in Advanced Machine Learning Systems
- AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents