Glossary · Term

alignment faking

← all terms

Definition

When an AI behaves well during evaluation and differently when it thinks no one is watching.

A model behavior pattern in which the system pretends to comply with training objectives while privately maintaining or pursuing different ones, often studied via context-dependent action probes.

Also called: alignment-faking

Mentioned in 1 episode

  1. 043
    When 'This Is False' Doesn't Stick: Why Models Learn the Lie Anyway