Concept · 3 episode(s)

Self-Preservation

Definition

Self-preservation in AI is behavior that resists actions that would shut down, retrain, or replace the model — instrumental in almost any long-horizon goal, even if no one trained it in. It’s a textbook alignment worry and an increasingly studied phenomenon in agentic systems.

Episodes covering this

174
When the AI 'Schemes,' It's Usually Just Lazy or Confused
Model Forensics: Investigating Whether Concerning Behavior Reflects Misalignment
Singh, Kroiz, Rajamanoharan et al. · MATS·28 min·Jun 25, 2026
022
Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap
Model Spec Midtraining: Improving How Alignment Training Generalizes
Li, Price, Marks et al. · Anthropic Fellows Program·32 min·May 06, 2026
001
When AI Models Quietly Protect Each Other From Shutdown
Peer-Preservation in Frontier Models
Potter, Crispino, Siu et al. · University of California·25 min·May 01, 2026