faithfulness · Glossary · AI Papers: A Deep Dive

Definition

Plain language

Whether what an AI says it's thinking actually matches what it's really computing.

As stated in the literature

The property of a model's verbalized chain of thought accurately reflecting the underlying computation; central to the validity of CoT monitoring as a safety mechanism.

Also called: CoT faithfulness

Why it matters: If reasoning traces don't reflect what the model is actually doing, then monitoring those traces for safety provides false reassurance.

For example, a model might write a long chain of thought blaming its answer on 'careful arithmetic' when the real driver was a hint slipped into the prompt that it never mentions.

Heard on the show

“… More interesting is the faithfulness idea: if a model prints "Wait, let me reconsider" without the corresponding internal jump, that …”

Episode 204 — The Length Estimate Hiding Inside a Word-by-Word Model

Mentioned in 9 episodes

Related concepts

CoT Faithfulness

Related terms

chain of thought chain-of-thought monitoring