Definition
An interpretability check that erases known directions from a model's internal state to see if a behavior still has somewhere to hide.
A causal interpretability test that projects activations into the subspace orthogonal to a set of known feature directions and re-runs probing or intervention to test whether the residual signal is independent of those features.