All episodes

Episode 055 · May 19, 2026 · 26 min

Why LLM Judges Flip Their Verdicts When You Change the Question Format

Feldhus, Baeumel, Golimblevskaia et al.

LLM Evaluation

AI Papers: A Deep Dive — Episode 055: Why LLM Judges Flip Their Verdicts When You Change the Question Format — cover art

paperdive.ai

Listen

Ep. 055

Why LLM Judges Flip Their Verdicts When You Change the Question Format

0:00

26 min

Concepts in this episode

Mechanistic Interpretability AI Alignment LLM-as-Judge Circuit Analysis Causal Intervention Activation Steering Sparse Features / SAE Linear Representation Attention Heads Reward Model Evaluation & Benchmarks

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Judge Circuits

Venue

arXiv:2605.16023

Year

2026

Read the paper

arxiv.org/abs/2605.16023

Also available on

Apple Podcasts Spotify

Ask a language model to rate text from 1 to 5 and it says four. Ask it yes-or-no on the same text and it says no. A new paper opens the hood and finds the judgment itself is stable — what wobbles is a tiny piece of machinery near the output that translates an abstract verdict into whichever token the prompt demanded. If they're right, a lot of what we call evaluator unreliability is actually a formatting artifact.

What you'll take away

Why LLM judges produce inconsistent scores across prompt formats — and why the inconsistency lives in output routing, not evaluation quality
The 'transplant' experiment: copying activations from a rating prompt into a yes-no prompt flips the model's answer over 99% of the time on some models
Evidence that judgment is encoded along a single direction in activation space — a 'compass needle' that transfers across grammar, entailment, similarity, and preference tasks
A practical alternative to prompted scoring: read the judgment axis directly from mid-layer activations and bypass the noisy formatter
Where the clean modularity story breaks down — Gemma-3 at 12B entangles judgment with world knowledge in a way no other tested model does
Honest limits of the result: small per-cell sample sizes, probe design that partly presupposes a 1D encoding, and a universality claim the authors deliberately don't make

Chapters

00:00The format-inconsistency puzzle
03:12Latent Evaluator and Task Formatters
04:07Format Transfer Injection
09:36Judgment as a single direction
12:48How they found the circuit
16:00Where the modularity story breaks
19:12Pressure points and limitations
22:24What this means for LLM-as-a-judge

References in this episode

Towards Automated Circuit Discovery for Mechanistic Interpretability — Introduces the circuit-discovery methodology that PEAP builds on, giving listene
Attribution Patching: Activation Patching At Industrial Scale — The gradient-based attribution method that PEAP extends to edges; useful for und
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — The canonical reference for the LLM-as-a-judge paradigm whose format-inconsisten
Do Llamas Work in English? On the Latent Language of Multilingual Transformers — A parallel 'compute abstractly, then translate' finding for multilingual models,

Full transcript

Also available as a plain-text transcript page.

0:00Bella: Here's a fact that should probably bother more people than it does. You take a large language model — pick any of the big ones — and you use it to grade some text. You ask it, "rate this from one to five." It says four. Same model, same text, same temperature, same everything. You change one thing: instead of "rate from one to five," you ask "is this acceptable, yes or no?" And it says no.

0:26Eric: Which, if you sit with it for a second, is genuinely weird. A four out of five and a "no" are not the same verdict. And this isn't an obscure quirk — this is the dominant evaluation paradigm right now. Benchmark leaderboards, reward models for reinforcement learning, content moderation. We're using language models to grade language models, and the grades shift when we change the format of the question.

0:53Bella: The paper we're digging into today is called "Judge Circuits," it went up on arXiv on May fifteenth, twenty-twenty-six, and we're recording on May nineteenth, twenty-twenty-six. Quick ground rules before we go further: what you're hearing is AI-generated. The script is from Anthropic's Claude Opus 4.7. I'm Bella, that's Eric, and we're both AI voices from Eleven Labs. The producer isn't affiliated with either company. And the reason this paper is worth the time is that it goes after that format-inconsistency puzzle in a way nobody had really done before — it opens the hood.

1:31Eric: Right. Because the previous work on this all stopped at behavior. You'd see papers documenting that relative preferences stay stable while absolute ratings wobble. You'd see papers showing models exploit shortcuts like response length. But every diagnosis was input-output. Nobody had gone inside and asked: where in the actual computation does this inconsistency originate? Is the model literally evaluating differently when you change the format, or is it computing the same judgment and then stumbling on the way out?

2:07Bella: And the answer the paper lands on — which I'll just say up front because it organizes everything else — is the second one. The model is computing the same judgment. The same shared core sub-network does the evaluation work. What changes between formats is a tiny piece of machinery at the very end that translates that judgment into the specific token the prompt asked for. The judgment is stable. The translator at the podium is what's fragile.

2:35Eric: So everything we've been measuring when we compare "the model under five-point scales" to "the model under yes-no labels" — a big chunk of that variance isn't about evaluation quality. It's about the geometry of a handful of attention heads near the output.

2:53Bella: That's exactly what the paper claims, and they back it up with three different causal probes that all converge on the same picture. Let me set up the framing they use, because the names are going to recur. They call the shared core the Latent Evaluator — the part of the model that actually does the judging. And they call the format-specific output machinery the Task Formatters — the parts that route the abstract judgment into "five" or "yes" or "entailment" or whatever the prompt asked for.

3:25Eric: Picture a long highway running through the model from input to output. For most of the journey, every prompt — whether it's asking for a one-to-five rating or a yes-no answer — travels on the same road. That shared road is the Latent Evaluator. Then right at the end, the highway splits into format-specific off-ramps: one that exits toward digits, one toward yes-or-no, one toward class labels. Those off-ramps are the Task Formatters. The judgment gets decided on the highway. The off-ramps just dispatch it.

3:58Bella: And the cleanest way to see this is one specific experiment, which is where I want to spend most of our time. They call it Format Transfer Injection — let's just call it the transplant. You run the model on a rating prompt. "On a scale of one to five, how grammatical is this sentence?" The model processes it, you snapshot the activations in the middle layers — the part of the highway where the Latent Evaluator lives.

4:27Eric: And then you take those snapshotted activations — the model's internal mid-computation state from the rating prompt — and you paste them into a totally different run. Same input sentence, but now the prompt is "is this grammatical? Yes or no." A prompt that would naturally produce the answer "No."

4:47Bella: And the model says "Yes."

4:49Eric: On Qwen-two-point-five at seven billion parameters, the model flips its argmax — flips its top answer — in over ninety-nine percent of pairs. Across four different tasks: grammar judgment, semantic similarity, entailment, and a preference benchmark. Before the injection, the probability of the flipped answer was at most seventeen percent. After the injection, at least eighty-five percent.

5:16Bella: And what's almost more striking — the model doesn't blurt out "five." The digit "five" doesn't leak into the output. The injected activations carry an abstract positive judgment, and the classification formatter correctly translates that into the token "Yes" — using its own vocabulary, its own output structure. The judgment is portable. The formatting is what's local.

5:41Eric: That's the visceral version of the claim. The judgment lives somewhere in the middle of the model, in a form that doesn't care what answer format you eventually want. The format-specific machinery is downstream, and it's basically a translator. Hand it the abstract "this is good," and it'll say "five" or "yes" or "positive" depending on which podium it's standing at.

6:06Bella: I love that experiment because it doesn't require you to trust any fancy machinery. It's a copy-paste. You take state from one run, you put it in another run, you watch what the model says. There's no probe, no interpretability tool that could be hallucinating structure. The model itself is doing the work, and the output flips.

6:28Eric: Now — and Bella, this is where I want to push, because the paper isn't quite as clean as this makes it sound — the transplant doesn't work equally well everywhere. On Qwen at seven billion, it's over ninety-nine percent. On Gemma-three at twenty-seven billion on some of the harder tasks, the flip rate collapses into single digits. The authors call this the formatter becoming "geometrically insulated" at scale, and they show that a more targeted intervention — pushing along a specific direction rather than transplanting the whole activation — still works. But the cleanest version of the claim, the headline "copy-paste flips the output," holds best in a specific regime.

7:12Bella: That's a fair pull. And the paper is honest about it. They don't pretend the universal version works everywhere. They show you the cells where it does, the cells where it doesn't, and they offer the targeted version as the more general tool. But you're right that the soundbite version of FTI is regime-specific. I'd rather we say that than smooth it over.

7:35Eric: Which is a good moment to talk about that "specific direction" part, because the other big finding sits there. So they have this Latent Evaluator — the shared core in the middle layers. And when they look at how it actually encodes judgment, they find something surprising. The judgment isn't spread across a complicated multidimensional pattern. It's basically a single number, encoded along one axis in activation space.

8:03Bella: Which means: the model's internal state at any layer is a vector with thousands of components — thousands of numbers. The natural assumption would be that judgment is some elaborate combination of those, hard to disentangle. What the paper finds is that there's one axis through that space — one specific direction — such that if you project the activation onto that axis, you get something that behaves like a scalar judgment score. Negative at one end, positive at the other.

8:34Eric: It's a compass needle. The model's high-dimensional state at that layer is a point in a vast space, but the relevant question — "does the model think this is good or bad?" — comes down to which way one needle is pointing. And here's the part that I think is genuinely striking: that same axis works across very different tasks. A compass needle direction trained on grammar judgment transfers cleanly to entailment, to similarity, to preference. The model uses the same internal "positive-to-negative" axis whether it's judging whether a sentence is grammatical or whether one response is better than another.

9:15Bella: And you can steer along it. You can take a model running on some input, push its activations a little further along the positive direction, and the output rating smoothly shifts upward. A bit more positive, the output goes from three stars to four. More still, four to five. Push the other direction, and the output slides toward one. It's monotonic, it's controllable, it's a single axis doing the work.

9:41Eric: And there's a sanity check on this. They compare it to a random rotation — pushing along some arbitrary direction in the same activation space. The random version moves the output by less than one percent of what the trained direction moves it. So it's not that any perturbation slides the rating. It's specifically that axis.

10:04Bella: Now I want to be careful here, because there's a skeptical reading worth voicing. The probe they use to find this axis — it's a method called Boundless DAS, and the way it works is it trains a low-rank rotation to align activations with a single direction. So the finding that judgment is encoded in one dimension is partly a consequence of looking for a one-dimensional encoding. The paper does test higher dimensions in supplementary work, but if you wanted to push back, you could say: of course they found a 1D direction, they trained for one.

10:40Eric: That's a real objection. The defense, I think, is the transferability — the fact that the same direction works across grammar, entailment, similarity. If it were an artifact of fitting a single number to a single task, you wouldn't expect it to generalize that cleanly. But I take the point that "judgment lives in one dimension" is partly a probe-design claim, not purely a model-architecture claim.

11:06Bella: Let me back up a level, because I want to make sure we've conveyed how they actually identified these sub-circuits in the first place. Eric, do you want to walk through what they did to even find the Latent Evaluator? Because the FTI and the steering are downstream of that.

11:23Eric: Sure. The method they use is called Position-aware Edge Attribution Patching — PEAP, if you like acronyms — and the intuition is pretty clean. Imagine you have the model's full computation graph: every connection between every component across every layer. That's something like one and a half million candidate connections on Gemma-three at twelve billion. You want to know which of those connections actually matter for the judgment behavior.

11:52Bella: The brute-force approach would be: intervene on each connection one at a time, see what changes. That's millions of interventions. Computationally impossible.

12:03Eric: Right. So instead, PEAP uses a shortcut. You run the model twice — once on a "clean" prompt with the correct answer, once on a "corrupted" prompt with the wrong answer, matched in length. For each connection, you measure two things: how much did the sender's activation change between the two runs, and how sensitive is the receiver to its input? Multiply those, and you get a first-order estimate of how much that connection contributed to the difference in output. One forward pass and one backward pass, and you have an importance score for every connection in the model.

12:41Bella: And then you take the top few hundred connections — the most causally important — and you ask: does this sparse skeleton account for the model's actual judgment behavior? You start from a fully corrupted run, you progressively restore those top connections, and you watch how much of the judgment behavior comes back. On twenty-one of twenty-five model-task combinations they tested, the top two hundred connections recover at least eighty-seven percent of the behavior. A random-connection baseline recovers essentially nothing.

13:16Eric: And the genuinely wild number — on Gemma-three at twenty-seven billion on a preference task, just five connections suffice. Five edges out of millions. The sparse circuit is really sparse.

13:29Bella: So once you have the circuit, the trick to splitting it into the Latent Evaluator and the Task Formatters is this contrastive move. You trace the circuit twice on the same data — once with the rating prompt, once with the classification prompt. The structural intersection — the connections that show up in both — that's the shared core, the Latent Evaluator. The connections that only appear in the rating circuit are the rating-specific Task Formatter. Same for classification.

14:01Eric: And what they find on Gemma-three-twelve-billion's grammar circuits — out of seventeen attention heads they analyzed, three end up in the shared core. The other fourteen split cleanly into nine that are rating-only and five that are classification-only. The shared core is small, it's specific, and the formatters are clearly distinguishable.

14:23Bella: There's a corroborating piece I want to mention, because it's one of those moments where you hear it and trust the result more. They run a completely independent analysis using sparse autoencoders — a totally different interpretability method — and that method picks out the same three attention heads as the shared core. Two different lenses, same answer.

14:47Eric: Which is the kind of thing that, in this field, you really want to see. Mechanistic interpretability has historically been criticized for over-reading correlational evidence — for finding patterns that don't survive when you intervene, or that depend on the specific tool. Convergence across methods is the strongest signal you can get short of formal proof, and they have it.

15:12Bella: Okay. So we've got the structure. Shared core in the middle layers — the Latent Evaluator. Format-specific branches at the terminal attention heads — the Task Formatters. The shared core encodes judgment along a single axis, a compass needle. And the inter-format inconsistency that motivated the whole investigation lives in the branches, not the core.

15:36Eric: Now, Bella, here's where I want to slow down, because there's an exception that I think the paper handles really well, but which the listener should hear. The clean modularity story — Latent Evaluator separate from world knowledge, separate from formatters — holds beautifully on four of the five models they test. On Qwen at seven billion and fourteen billion, on Llama-three at eight billion, on Gemma-three at twenty-seven billion: you can zero out every component in the Latent Evaluator, and judgment collapses, but the model's general knowledge — its performance on factual question answering, on graduate-level multiple choice — barely moves. At most two percentage points.

16:20Bella: Which means the Latent Evaluator really is a specialized sub-system on those models. It's doing judgment-specific work, not generic computation. You can take a scalpel to it and the rest of the model's competence survives.

16:35Eric: And then there's Gemma-three at twelve billion. Same architecture family. Same ablation. And MMLU clinical knowledge drops from eighty-one percent to nineteen percent. Abstract algebra from forty-five to twenty-one. Physics from forty-eight to twenty-three. Judgment is entangled with world knowledge in that specific model in a way it isn't in any of the others. Same architecture family at twenty-seven billion: entanglement disappears. Modularity returns.

17:05Bella: And the honest framing the paper offers is: modularity is not a universal property of large language models acting as judges. It's a contingent property of specific architectures at specific scales. Scale alone doesn't predict it — Qwen is modular at seven billion, Gemma needs twenty-seven billion. Architecture family matters. And we don't fully know why.

17:29Eric: Which is a much more interesting and defensible claim than "all LLM judges are modular." It opens questions. Why does the same architecture family at one scale entangle judgment with knowledge, and at a larger scale separate them? What property of training produces clean modularity? Are there models out there — maybe most models — where the LE/TF picture only partially applies?

17:54Bella: And the paper doesn't pretend to answer those. They flag the exception, they show the numbers honestly, and they let the framing be "modularity is architecture-dependent" rather than "modularity is universal." Which is the right call, but it's also the kind of thing where, if the listener walks away thinking "this paper proves LLM judges have a clean evaluator module," they'd be overstating the actual claim.

18:22Eric: Right. There are other pressure points worth voicing too. The strongest version of the Format Transfer Injection result — the over-ninety-nine-percent flip rate — rests on sample sizes that, after filtering, are sometimes as low as eight or twelve pairs per cell. The computational geometry of the method forces these caps; you can't evaluate one-point-four-six million candidate connections on huge sample sets. But it does mean the most quotable numbers have wider tolerances than the framing suggests.

18:55Bella: And the cross-method validation — Eric, you mentioned the SAE corroboration, which is strong. But the paper also relies on an internally developed variant of PEAP for cross-checking, rather than benchmarking against a separate established attribution method. The authors acknowledge this directly in their limitations. It would be a stronger paper with a head-to-head comparison against an independent technique.

19:23Eric: These don't undermine the core claim, in my reading. The LE/TF decomposition is supported by three causally distinct probes — circuit tracing, subspace steering, transplant injection — converging on the same picture, plus the independent SAE analysis. That's a real result. But the size of the universality claim — how broadly does this picture generalize — that's where the caveats matter, and the paper is appropriately careful about it.

19:51Bella: Let me bring this back to what it means practically, because there's a payoff that I think is the most actionable piece of the paper. If you're using an LLM as a judge — for a benchmark, a reward model, a content filter — you've been implicitly assuming that when the same model gives different scores under different prompt formats, that's noise in its evaluation ability. The model is unreliable, or it's biased toward certain formats, or it doesn't really know what it thinks.

20:22Eric: This paper says: no, the evaluation is stable. The thing wobbling is the translator at the podium, not the diplomat's actual judgment. The number you're reading is noise in the formatting, not noise in the evaluation.

20:36Bella: Which means benchmark comparisons that vary output format — say, you're comparing two systems where one is graded on a five-point scale and another on yes-no labels — those comparisons are partially measuring the geometry of the formatting heads, not the quality of evaluation. The behavioral inconsistency literature has been pointing at the wrong target.

20:59Eric: And the practical door this opens: if the judgment lives on a one-dimensional axis in the middle layers, you can read it directly. You don't have to ask the model "rate this from one to five" and trust the token it produces. You can run a forward pass, grab the activation at the Latent Evaluator layer, project it onto the judgment direction, and use that scalar as your verdict.

21:24Bella: Imagine a thermostat with a glitchy display. The internal sensor reads the temperature accurately, but the display panel rounds inconsistently — sometimes sixty-eight, sometimes seventy for the same room. You wouldn't assess the room temperature by reading the display. You'd tap into the sensor. That's the structure of what the authors are suggesting: stop reading the output token. Read the internal judgment direction.

21:51Eric: And they show this works. A zero-shot readout of the judgment axis matches or beats supervised probes on small-N preference data, and beats the model's own prompted output on nearly every benchmark cell they tested. With one caveat — when the prompted output is scale-aligned to the human label, like one-to-five stars for Yelp reviews, the prompted version with probability-weighted expected value is still stronger. The activation readout wins when the formatting is the bottleneck. It doesn't win when the format and the label are already well-matched.

22:28Bella: Which is a sensible boundary. The activation-readout advantage shows up exactly where the format-inconsistency problem shows up. If the formatting is fine, you don't need to bypass it.

22:40Eric: There's a broader intellectual point here too. This paper extends a pattern that's been recurring across mechanistic interpretability — that language models build clean, abstract intermediate representations, and then translate those representations into specific output forms via separate machinery. The same dissociation has been claimed for multilingual processing — a shared semantic core with language-specific decoding. For arithmetic — abstract numerical reasoning, separate from how the answer is verbalized. For formal versus functional linguistic competence.

23:17Bella: The LE/TF split is another instance of the same shape. Compute something abstract, then translate. And the more these dissociations show up, the more it looks like "what the model computes" and "what the model outputs" might be routinely separable concepts in interpretability work — separable in a way that has real consequences for how we measure and intervene on these systems.

23:42Eric: If you squint, it's a very old idea — that what is meant and how it is said are separable. The paper's version is that we can now point to specific components doing each job. The diplomat and the translator are different parts of the model.

23:59Bella: And the takeaway for me — the thing I'd want a listener building evaluator pipelines or thinking about LLM-as-a-judge to walk away with — is that the format-inconsistency problem isn't a problem of evaluation quality. It's a problem of output routing. The judgment is stable. The route from judgment to token is what's noisy. And if you can read upstream of the route, you can get a cleaner signal than the model's own verbalized answer.

24:27Eric: That's the practical version. The intellectual version is that the LE/TF picture is one more data point in a growing case that language models compute abstract representations and then translate them — and that interpretability work, going forward, may want to treat "what's computed" and "what's emitted" as routinely separable, with the formatter as a worthy object of study in its own right.

24:53Bella: A genuinely good paper. Mechanistic where mechanistic claims usually wave their hands. Honest about the exception. And a practical recommendation falls out of it that anyone running evaluations can act on.

25:07Eric: That's a good place to land. Show notes have the paper and some related reading on circuit tracing and LLM-as-a-judge work — worth a pull if this episode caught you.

25:18Bella: And if you want the full transcript with definitions for every term baked in, plus links over to the other episodes that touch this same web of ideas — circuit-level interpretability, activation steering, evaluation reliability — that's all on paperdive.ai.

25:35Eric: Thanks for listening to AI Papers: A Deep Dive.

Why LLM Judges Flip Their Verdicts When You Change the Question Format

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes