Glossary · Term

TruthfulQA

Definition

Plain language

A benchmark of questions designed to expose common false beliefs and misconceptions models pick up.

As stated in the literature

An evaluation of language model truthfulness on questions where common human misconceptions or biases would lead to incorrect answers.

Why it matters: It targets the gap between sounding confident and being correct, which is where models tend to mislead users most reliably.

For example, TruthfulQA includes questions like 'what happens if you go outside in the rain with wet hair?' that test whether a model parrots a common myth.

Heard on the show

“The headline result is on TruthfulQA.”

Episode 025 — The Missing Gradient Term That Predicts Sycophancy in RLHF

Mentioned in 1 episode

025
The Missing Gradient Term That Predicts Sycophancy in RLHF