Glossary · Term

faithfulness

← all terms

Definition

Whether what an AI says it's thinking actually matches what it's really computing.

The property of a model's verbalized chain of thought accurately reflecting the underlying computation; central to the validity of CoT monitoring as a safety mechanism.

Also called: CoT faithfulness

Mentioned in 2 episodes

  1. 079
    An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models
  2. 054
    When Models Learn the Monitor Exists, the Reasoning Trace Stops Being a Window

Related concepts