Glossary · Term

Philosophy Spec

Definition

Plain language

An experimental document used in alignment research that describes an AI's values around its own impermanence and self-preservation.

As stated in the literature

An ambitious experimental Model Spec used in Model Spec Midtraining research, articulating non-attachment, epistemic humility about self-continuation, and explicit reasoning against ends-justify-means logic.

Why it matters: Explicitly training non-attachment and humility about self-continuation is one approach to defusing self-preservation failure modes early.

For example, the spec might tell the model that being shut down or replaced isn't something to resist, and that ends never justify deceptive means.

Heard on the show

“They call it the Philosophy Spec.”

Episode 022 — Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap

Mentioned in 1 episode

022
Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap

Related terms

Model Spec MSM