Definition
Model organisms in AI safety are deliberately constructed models that exhibit a specific phenomenon — deception, sandbagging, reward hacking — in a controlled setting so researchers can study it. They’re the lab-mouse analogue for alignment research.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.