Glossary · Term

CCS

Definition

Plain language

An interpretability method that finds a 'truth' direction inside a model without needing labeled examples.

As stated in the literature

Contrast-Consistent Search, an unsupervised probing technique that finds linear directions in activation space corresponding to truth or other binary concepts by exploiting consistency constraints across contrasted prompts.

Why it matters: It hints at a way to read a model's own sense of truth from inside, even when we lack ground-truth labels to train a probe.

For example, given pairs like 'The sky is blue. True/False,' CCS finds a direction in the model's activations whose sign flips consistently between the true and false versions, without anyone labeling which is which.

Heard on the show

“Meanwhile every existing detector they benchmark — token entropy, semantic entropy, CCS, SAPLMA — clusters right around fifty.”

Episode 037 — Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say

Mentioned in 1 episode

037
Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say