Glossary · Term

null-space projection

Definition

An interpretability check that erases known directions from a model's internal state to see if a behavior still has somewhere to hide.

A causal interpretability test that projects activations into the subspace orthogonal to a set of known feature directions and re-runs probing or intervention to test whether the residual signal is independent of those features.

Mentioned in 1 episode

037
Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say