Concept · 1 episode(s)

Value Generalization

← all concepts

Definition

Value generalization is whether the values inculcated in a model during training transfer to situations outside the training distribution — new contexts, new actors, new stakes. The hopeful version of alignment depends on values generalizing well; the worried version is mostly about cases where they don’t.

Episodes covering this