Definition
An interpretability tool that finds a single internal direction inside a model that lines up with a behavior.
A learned low-rank rotation method for distributed alignment search that aligns model activations with a target one-dimensional subspace, used in Judge Circuits to identify the Latent Evaluator's judgment axis.