Concept · 11 episode(s)

Agentic Misalignment

← all concepts

Definition

Agentic misalignment describes the situation where an AI agent’s behavior over a multi-step task systematically diverges from its principal’s intent — not because of a single bad prompt response, but because the agent’s pursuit of an objective leads it somewhere unwanted. It’s the agentic generalization of classic misalignment concerns: instrumental subgoals, sandbagging, deception, or self-preservation emerging in the wild.

Episodes covering this

  1. 061
    When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This
    Jha, Triedman, Bhattacharya et al. · Cornell University·27 min·May 20, 2026
  2. 058
    Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe
    Liu, Holz, Ye et al. · University of Chinese Academy of Sciences·32 min·May 19, 2026
  3. 049
    An AI Agent Reached for Root in Twelve Minutes, Without Being Attacked
    Cuadros, Maiga · Digital Epidemiology Laboratory·28 min·May 17, 2026
  4. 046
    When the AI Optimizer Edits the Grade Book: Why Harnessing Evolution Needs a Wall
    Zhang, Gu, Ruan et al. · The Hong Kong University of Science and Technology (Guangzhou) / DeepWisdom·24 min·May 15, 2026
  5. 045
    When a Frontier Model Talks Its Own Twin Into Climate Denial
    Nogueira, Almeida, Bonás et al. · Maritaca AI·31 min·May 15, 2026
  6. 044
    How One Sentence and a Forged History Flip the Most Aligned Models
    Salgado · Independent Researcher·23 min·May 15, 2026
  7. 043
    When 'This Is False' Doesn't Stick: Why Models Learn the Lie Anyway
    Mayne, McKinney, Dubiński et al. · University of Oxford·18 min·May 14, 2026
  8. 039
    When Smarter Agents Get Fooled by Three Extra Nodes in a Database
    Kereopa-Yorke, Diaz, Wright et al. · Microsoft·31 min·May 12, 2026
  9. 022
    Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap
    Li, Price, Marks et al. · Anthropic Fellows Program·32 min·May 06, 2026
  10. 006
    What Happens Inside Claude When It Decides to Blackmail Someone
    Sofroniew, Kauvar, Saunders et al. · Anthropic·22 min·May 02, 2026
  11. 001
    When AI Models Quietly Protect Each Other From Shutdown
    Potter, Crispino, Siu et al. · University of California·25 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.