Definition
Whether what an AI says it's thinking actually matches what it's really computing.
The property of a model's verbalized chain of thought accurately reflecting the underlying computation; central to the validity of CoT monitoring as a safety mechanism.
Also called: CoT faithfulness