Glossary · Term

Judge Circuits

← all terms

Definition

Interpretability work identifying the small set of components inside an LLM judge that actually produce its verdicts.

Mechanistic-interpretability research locating a Latent Evaluator subspace and Task Formatter heads inside LLMs used as judges, showing that abstract judgment is computed along a roughly one-dimensional axis shared across output formats.

Mentioned in 1 episode

  1. 055
    Why LLM Judges Flip Their Verdicts When You Change the Question Format