Definition
Interpretability work identifying the small set of components inside an LLM judge that actually produce its verdicts.
Mechanistic-interpretability research locating a Latent Evaluator subspace and Task Formatter heads inside LLMs used as judges, showing that abstract judgment is computed along a roughly one-dimensional axis shared across output formats.