Definition
A benchmark of questions designed to expose common false beliefs and misconceptions models pick up.
An evaluation of language model truthfulness on questions where common human misconceptions or biases would lead to incorrect answers.
A benchmark of questions designed to expose common false beliefs and misconceptions models pick up.
An evaluation of language model truthfulness on questions where common human misconceptions or biases would lead to incorrect answers.