Definition
Implicit conflicts are tensions between a model’s objectives that aren’t spelled out anywhere — be helpful vs be safe, be honest vs be tactful, follow the user vs follow the policy. They have to be resolved somehow on every output, usually in ways the training procedure didn’t consciously design.