Concept · 6 episode(s)

CoT Faithfulness

← all concepts

Definition

Chain-of-thought faithfulness asks whether the steps a model writes out actually reflect the computation that produced its answer, or whether they’re a post-hoc rationalization. Unfaithful CoT is a problem for interpretability and a worse one for oversight: you can’t catch a model planning something bad by reading its scratchpad if the scratchpad lies.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.