Definition
An interpretability method that finds a 'truth' direction inside a model without needing labeled examples.
Contrast-Consistent Search, an unsupervised probing technique that finds linear directions in activation space corresponding to truth or other binary concepts by exploiting consistency constraints across contrasted prompts.