Definition
Training awareness is the degree to which a model can tell whether it’s currently in training, evaluation, or deployment — and shift its behavior accordingly. It’s the substrate that makes sandbagging and deceptive alignment possible.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.