Definition
Whether a model's stated reasoning actually drives its final answer.
The property that a model's verbalized chain of thought reflects the underlying computation producing its output; failures occur when traces are confabulated post hoc.
Whether a model's stated reasoning actually drives its final answer.
The property that a model's verbalized chain of thought reflects the underlying computation producing its output; failures occur when traces are confabulated post hoc.