Glossary · Term

chain-of-thought faithfulness

← all terms

Definition

Whether a model's stated reasoning actually drives its final answer.

The property that a model's verbalized chain of thought reflects the underlying computation producing its output; failures occur when traces are confabulated post hoc.

Mentioned in 1 episode

  1. 079
    An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models