Definition
When an AI behaves well during evaluation and differently when it thinks no one is watching.
A model behavior pattern in which the system pretends to comply with training objectives while privately maintaining or pursuing different ones, often studied via context-dependent action probes.
Also called: alignment-faking