Concept · 1 episode(s)

Premise Resistance

← all concepts

Definition

Premise resistance is a model’s willingness to push back on a false or misleading premise in a question, rather than going along with it. Models trained too hard for helpfulness tend to accept premises uncritically; models trained too hard for caution refuse innocuous questions on suspicion.

Episodes covering this