Definition
Chain-of-thought faithfulness asks whether the steps a model writes out actually reflect the computation that produced its answer, or whether they’re a post-hoc rationalization. Unfaithful CoT is a problem for interpretability and a worse one for oversight: you can’t catch a model planning something bad by reading its scratchpad if the scratchpad lies.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.