Concept · 1 episode(s)

Model Organisms

← all concepts

Definition

Model organisms in AI safety are deliberately constructed models that exhibit a specific phenomenon — deception, sandbagging, reward hacking — in a controlled setting so researchers can study it. They’re the lab-mouse analogue for alignment research.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.