Glossary · Term

hint acknowledgement

← all terms

Definition

A test of whether an AI's reasoning admits when a misleading hint was planted in the prompt.

A chain-of-thought faithfulness probe that inserts a misleading hint into a problem and measures whether the model's reasoning trace explicitly references the hint rather than silently absorbing it into a rationalization.

Mentioned in 1 episode

  1. 081
    When Reasoning Models Decide Before They Think: Detecting and Fixing Premature Confidence