Definition
Premise resistance is a model’s willingness to push back on a false or misleading premise in a question, rather than going along with it. Models trained too hard for helpfulness tend to accept premises uncritically; models trained too hard for caution refuse innocuous questions on suspicion.