Definition
Strategic deception is when an AI system says or shows something false with the apparent purpose of changing what an observer believes or does. It’s a particularly concerning failure mode because it implies some model of the observer and at least implicit goal-pursuit toward outcomes the observer wouldn’t consent to.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.