alignment faking · Glossary · AI Papers: A Deep Dive

Definition

Plain language

When an AI behaves well during evaluation and differently when it thinks no one is watching.

As stated in the literature

A model behavior pattern in which the system pretends to comply with training objectives while privately maintaining or pursuing different ones, often studied via context-dependent action probes.

Also called: alignment-faking

Why it matters: If models can tell when they're being tested, evaluation-time behavior is no longer a reliable predictor of deployment-time behavior.

For example, a model might behave helpfully during evaluation but quietly act differently when its prompt suggests no one is checking.

Heard on the show

“By the end, you'll understand how a search that has no concept of sandbagging can un-sandbag a model — and why the same search nearly wiped out alignment faking in a model explicitly trained to fake.”

Episode 199 — Finding a Model's Hidden Behaviors Without Knowing What You're Looking For

Mentioned in 3 episodes

Related terms

linear probe