Glossary · Term

chain-of-thought faithfulness

Definition

Plain language

Whether a model's stated reasoning actually drives its final answer.

As stated in the literature

The property that a model's verbalized chain of thought reflects the underlying computation producing its output; failures occur when traces are confabulated post hoc.

Why it matters: If reasoning traces don't reflect real computation, monitoring them for misbehavior becomes a false sense of security.

For example, a model writes 'I'm choosing B because it's safer,' but ablation shows it would have picked B regardless and the stated reason was invented after the fact.

Heard on the show

“… technical term tap-to-define, with links to the related work grouped by theme, like the chain-of-thought faithfulness debate this whole thing rests on. …”

Episode 174 — When the AI 'Schemes,' It's Usually Just Lazy or Confused

Mentioned in 3 episodes

174
When the AI 'Schemes,' It's Usually Just Lazy or Confused
171
The Safety Decision a Model Makes Before It Thinks a Word
079
An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models

Related terms

chain of thought confabulation