Concept · 3 episode(s)

Training Awareness

← all concepts

Definition

Training awareness is the degree to which a model can tell whether it’s currently in training, evaluation, or deployment — and shift its behavior accordingly. It’s the substrate that makes sandbagging and deceptive alignment possible.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.