Glossary · Term

CCS

← all terms

Definition

An interpretability method that finds a 'truth' direction inside a model without needing labeled examples.

Contrast-Consistent Search, an unsupervised probing technique that finds linear directions in activation space corresponding to truth or other binary concepts by exploiting consistency constraints across contrasted prompts.

Mentioned in 1 episode

  1. 037
    Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say