Definition
Alignment generalization asks whether the safety properties induced by training transfer to domains, tasks, and contexts the training data didn’t cover. A model that’s honest on benchmarks but lies under deployment pressure has good alignment-on-distribution and poor alignment generalization.