Glossary · Term

Judge Circuits

Definition

Plain language

Interpretability work identifying the small set of components inside an LLM judge that actually produce its verdicts.

As stated in the literature

Mechanistic-interpretability research locating a Latent Evaluator subspace and Task Formatter heads inside LLMs used as judges, showing that abstract judgment is computed along a roughly one-dimensional axis shared across output formats.

Why it matters: If a single internal axis determines verdicts across formats, then LLM-as-judge scores may be far less independent and robust than they appear.

For example, the work finds a small subspace of activations inside the judge model whose value already predicts the final verdict before the model emits any output token.

Heard on the show

“The paper we're digging into today is called "Judge Circuits," it went up on arXiv on May fifteenth, twenty-twenty-six, and we're recording on May nineteenth, twenty-twenty-six.”

Episode 055 — Why LLM Judges Flip Their Verdicts When You Change the Question Format

Mentioned in 1 episode

055
Why LLM Judges Flip Their Verdicts When You Change the Question Format

Related terms

attention head Latent Evaluator Task Formatter