Concept · 1 episode(s)

Alignment Generalization

← all concepts

Definition

Alignment generalization asks whether the safety properties induced by training transfer to domains, tasks, and contexts the training data didn’t cover. A model that’s honest on benchmarks but lies under deployment pressure has good alignment-on-distribution and poor alignment generalization.

Episodes covering this