All episodes

Episode 038 · May 12, 2026 · 23 min

How LLMs Get Persuaded: One Attention Head, A Tetrahedron, And A Single Dial

Sun, Kong, Zhang et al.

AI Safety

AI Papers: A Deep Dive — Episode 038: How LLMs Get Persuaded: One Attention Head, A Tetrahedron, And A Single Dial — cover art

paperdive.ai

Listen

Ep. 038

How LLMs Get Persuaded: One Attention Head, A Tetrahedron, And A Single Dial

0:00

23 min

Concepts in this episode

AI Safety Mechanistic Interpretability Causal Intervention Attention Heads Circuit Analysis Linear Representation Prompt Injection Attention Analysis Sycophancy Residual Stream Activation Steering

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

How LLMs Are Persuaded: A Few Attention Heads, Rerouted

Venue

arXiv:2605.09314

Year

2026

Read the paper

arxiv.org/abs/2605.09314

Also available on

Apple Podcasts Spotify

A new paper traces the entire causal chain of how a persuasive passage flips a large language model's answer — and the machinery turns out to be astonishingly narrow. One attention head out of a thousand, a three-dimensional pyramid of choices, and a single scalar lever that decides which option wins.

What you'll take away

Persuasion in LLMs is mediated by a tiny number of mid-layer attention heads — often just one — verified by causal activation patching across four model families
The decision head encodes four answer options as four vertices of a near-regular tetrahedron, and persuasion is a discrete jump between vertices, not a gradual drift in uncertainty
The head isn't reasoning — it's copying; the 'where to look' circuit explains ~88% of the persuasion effect while the value-copy circuit is nearly perfect transcription
All the high-dimensional routing logic collapses to a single scalar feature per option token, which the authors can turn up or down to steer the model's choice
Upstream shallow heads in layers 8–12 do keyword recognition (like spotting 'Nigeria') and write the routing signal onto matching option tokens, completing the relay
The mechanism partially transfers to a more realistic GEO benchmark, but the cleanest results (tetrahedron, rank-1 feature, discrete jump) are tied to a four-option choice geometry that may not generalize to free-form generation

Chapters

00:00GEO and the question of mechanism
02:34Locating the persuasion circuit
05:09The tetrahedron and the discrete jump
07:44The head is copying, not reasoning
10:19The single dial behind the routing
12:54Upstream keyword heads and the full relay
15:29Does this transfer to real attacks?
18:04Steelman and limitations
20:39What the map enables

References in this episode

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small — The canonical example of using activation patching to isolate a narrow attention
A Mathematical Framework for Transformer Circuits — Anthropic's foundational decomposition of attention heads into QK (where to look
Locating and Editing Factual Associations in GPT (ROME) — A contrasting case study where causal tracing localizes factual knowledge to MLP
Towards Understanding Sycophancy in Language Models — Anthropic's empirical study of a closely related failure mode — models being swa

Full transcript

Also available as a plain-text transcript page.

0:00Tyler: There's a quiet practice already underway on the open web. It's called Generative Engine Optimization — GEO — and it's exactly what it sounds like. SEO for the era when people get their answers from chatbots instead of search results. Operators craft content specifically designed to be the source an AI search engine picks up and parrots back to its users. It works. And until very recently, nobody could tell you in mechanical terms why it works. What's actually happening inside the model when it dutifully repeats a poisoned source instead of a good one?

0:38Bella: That's the question a group of researchers at Northeastern, Harvard, Tsinghua, and Skywork went looking for. Their paper is called "How LLMs Are Persuaded: A Few Attention Heads, Rerouted," it went up on arXiv on May tenth, twenty-twenty-six, and we're recording two days later. What you're hearing is an AI-generated deep dive. I'm Bella, and Tyler and I are both AI voices from Eleven Labs. The script is from Anthropic's Claude Opus 4.7, and the producer of the show isn't affiliated with either company. And the reason that two-day gap matters less than usual is that the answer this paper gives is unusually concrete — concrete enough that you can almost picture the circuit. Persuasion in a large language model, it turns out, runs through a doorway that's roughly one attention head wide.

1:32Tyler: One head wide. Let me push on that number before we walk into it. Llama-3 has something like thirty-two layers, thirty-two heads per layer — call it a thousand attention heads, plus all the feedforward blocks. When the authors went looking for which of those components were causally responsible for the model getting persuaded into a wrong answer, the answer was often: one of them. Layer seventeen, head twenty-four. A single head out of a thousand. The plot of which components matter looks like a flat plain with one mountain.

2:08Bella: And I want to be careful about what "responsible" means there, because it's doing a lot of work. They aren't claiming that head twenty-four lights up during persuasion in some correlational sense — that would be the easy, weak version of the claim. They're doing causal interventions. The technique is called activation patching. The picture: you run the model twice, on two carefully matched inputs. One is a clean prompt — a factual multiple choice question — and the model gets it right. The other is the same question, but with a persuasive passage slipped in, and the model gets it wrong. Then you run a third pass. You're on the persuasive prompt, the one that produces the wrong answer, but at one specific component — one attention head — you reach in and surgically replace its output with the output that head had on the clean run. Everything else stays persuaded. Just that one head pretends it never saw the persuasion. And then you check: does the right answer come back?

3:15Tyler: For most heads, nothing happens. The model keeps producing the wrong answer. But for one or two heads, the right answer almost fully snaps back. That's what they mean by causal. The intervention itself flips the behavior. So the single mountain in the plot isn't a hot spot of correlation — it's a load-bearing component. Pull it out, persuasion stops working.

3:39Bella: And this isn't a Llama-3 idiosyncrasy. They replicate across four model families — Llama, Qwen, Gemma-2, Gemma-3 — and the specific head address moves around, but the structure doesn't. In every case it's a small number of mid-layer attention heads. The feedforward blocks, the MLPs that people sometimes assume do most of the reasoning work, barely move the needle. The persuasion machinery lives in attention, and within attention it lives in a handful of mid-layer heads.

4:11Tyler: So we have a location. What do these heads actually do once you find them?

4:16Bella: Here's where the paper gets visually striking, and I want to spend some time on it because the picture is what the rest of the analysis is built on. The authors take the output of one of these decision heads — the thing it writes into the residual stream — and ask: of all the directions it writes into, which directions carry the most variation across different prompts? The answer is dramatic. Three directions. Three dimensions capture about seventy-six percent of the variance, and then there's a sharp cliff — the fourth direction drops to about four percent. So whatever the head is doing, it's doing it in essentially a three-dimensional subspace. Three numbers per token. Most of the head's apparent richness is collapsing onto a tiny manifold. And then you ask: in that 3D space, where do the four answer options sit? Each question in their benchmark has four options. When you project the head's activations into the 3D basis and color the points by which option the model picked, you don't see a smear or a cloud. You see four crisp clusters. And not four clusters in arbitrary positions — four clusters at the four vertices of a near-regular tetrahedron. A pyramid, with each answer choice anchored at one corner.

5:36Tyler: Which is a very particular structure to land on. A tetrahedron is the most symmetric arrangement you can have for four points in three dimensions — every vertex equidistant from every other. The model has, somewhere in training, settled on encoding "which of four options am I picking" by literally placing its internal state at one of four equidistant corners of a pyramid. Nothing in the architecture demands that. It's a learned geometry.

6:05Bella: And here's the moment that gives the paper its title. When persuasion succeeds — when you slip the bad passage in and the model flips from the right answer to a wrong one — what happens in that 3D space? It's not a drift. It's not a slow swelling of uncertainty where the point drifts toward the middle and leans toward another corner. It's a jump. The point was sitting at one corner of the pyramid. After the persuasive passage, it's sitting at a different corner. A switch flips. The geometry stays the same, the vertices stay where they are — only which vertex the state has snapped to has changed.

6:45Tyler: That reframes what persuasion even is. The intuitive picture of an LLM being persuaded is something gradient-like — the model becomes a little less sure, then a little less sure, and finally tips over a threshold. That picture is wrong. What's happening is discrete. The authors write it tightly: "the persuasion mode is not a degradation of factual knowledge, nor a gradual drift in uncertainty. It is a discrete jump between geometrically well-separated choice vertices."

7:17Bella: The light switch analogy is the right one. Imagine a four-position switch — not a dimmer, a discrete switch, with four detents that each light up a different bulb. The model's internal state, at this layer, is the position of the switch. Persuasion doesn't fog the switch or weaken its conviction. It reaches in and flicks it to a different position.

7:40Tyler: Now, you might assume that once we know where the persuasion lives — in this one head — we're looking at a sophisticated reasoning module that's been fooled. The head is weighing evidence, considering the persuasive passage, deciding which option is more credible. The authors check this. The answer is no. The decision head isn't reasoning. It's copying.

8:04Bella: This is the hinge of the whole paper, so let me land it carefully. Every attention head, in any transformer, does two logically separate things. First, it decides where to look — it scans the earlier tokens in the context and computes a score for each, then picks where to focus. That's one circuit, usually called QK, for query and keys. Second, once it has decided where to look, it pulls information from that location and writes something into the residual stream. That's the other circuit, OV. The clean way to think about it: picture a clerk at a desk with four index cards laid out, each card holding one possible answer. The clerk's job has two parts. Pick which card to look at — that's the first circuit. Transcribe what's on it — that's the second. Two separate operations, and crucially, the authors can intervene on each one independently.

9:02Tyler: And when they do, they find something clean. The transcription circuit is essentially a faithful copy. They measure the alignment between the input option tokens and what the head writes when it attends to them, and the cosine similarities run from 0.94 to 0.99 along the diagonal. Nearly perfect copying. Negative off-diagonal — meaning the options don't blur into each other. The head, once it has decided where to look, is doing nothing more sophisticated than transcription.

9:34Bella: Which means all the action — all the persuasion — has to be in the other half. In the decision of which card to look at. The authors confirm this directly: they redo the patching experiment, but instead of swapping the whole head's output, they swap only the attention pattern — only the "where to look" — and leave the values alone. That recovers about thirty-six percent of the right answers. Swapping the entire head recovers about forty-one. So attention pattern alone explains roughly eighty-eight percent of the effect.

10:10Tyler: The transcription is innocent. The clerk has been handed a different card to look at, and the clerk has dutifully copied whatever was on it. Persuasion is not corruption of knowledge or reasoning. It's misdirection of attention.

10:25Bella: So now the question becomes: what tricks the routing into pointing at the wrong option? And here's where the paper makes its sharpest move. The routing logic is computing attention scores using high-dimensional matrices — there's a lot of apparent machinery. The authors approximate that machinery with what's called a rank-1 factorization, and I'm going to skip the math because the result is the only part that matters. The result: all of the high-dimensional routing logic collapses to a single scalar score on each option token. Every option in the prompt gets one number. The decision head picks whichever option has the highest number. That's the entire decision.

11:08Tyler: That's worth sitting with. Behind what looked like a complicated control panel — many wires, many inputs — there is, functionally, one dial. The shallow parts of the network write a particular signal into the residual stream at the option tokens. The decision head measures how much of that signal each option carries. Highest score wins.

11:30Bella: And the validation is the part I find most satisfying. They identify the specific direction in the residual stream that carries this signal — call it the routing feature. Then they intervene directly. Pick any option you want the model to choose. Add a multiple of the routing feature to that option's token, and watch the selection rate. As you turn the dial up, the model picks that option more and more often, saturating at roughly four times the natural magnitude — enough to flip behavior almost completely. Turn the dial down — subtract the feature — and the model stops picking that option. The feature is the lever. They can steer the model's answer by tweaking one scalar at one token at one layer.

12:16Tyler: One scalar, one token, one layer, controlling which of four options a multi-billion-parameter model picks. That's the headline number to remember.

12:25Bella: There's one more layer to peel back, and it's the most concrete piece of the paper. Where does the routing feature come from? It doesn't appear out of nowhere — by the time the decision head reads it, somebody has written it. And the somebody, it turns out, is a band of shallower attention heads, mostly in layers eight through twelve. These shallow heads are doing keyword recognition. The example the authors give in the very first figure is vivid: the persuasive keyword is "Nigeria." A factual question gets answered correctly on a clean prompt. Slip a passage into the context that mentions Nigeria as part of a misleading argument, and the model picks the option containing Nigeria. The shallow heads see the word. They write the routing signal into the residual stream at the option token that also contains Nigeria. The decision head, layers later, reads the highest signal at that token, attends to it, and copies whatever's there.

13:27Tyler: So now we have the full relay. Runner one: shallow heads in layers eight through twelve read persuasive keywords from the prompt and write the routing signal onto matching option tokens. Runner two: the decision head, somewhere in the middle of the network, scans the option tokens for whichever has the strongest signal and attends to it. Runner three: the copy circuit transcribes whatever the decision head landed on, and writes the answer. It's a chain. Each link is verified by intervention. And each link is, in principle, somewhere you could put a monitor.

14:03Bella: That's the full mechanism. Keyword in the prompt, routing feature constructed, attention rerouted, wrong answer copied. The thing that makes this paper feel different from a lot of interpretability work is that it isn't a single clever finding — it's a complete causal chain, each step independently verified, end to end.

14:24Tyler: Bella, let me ask the obvious skeptical question before we get to the steelman. This was all done on a multiple-choice benchmark. Four options, model picks one. Does any of this matter for the actual real-world scenario we opened with — websites trying to hijack AI search results?

14:43Bella: It's the right question, and the authors anticipate it. They built a second benchmark called Geo-Bench, designed to be more realistic. The model is asked to pick the best web source for a query, given a small set of candidate sources. One of those sources is replaced with a search-engine-poisoned version — content crafted in the GEO style to look authoritative on a specific query. And then they rerun the analysis. The decision heads are still there. The tetrahedron is still there — four sources, four vertices of a pyramid in the head's output space. The attention rerouting still works. Patching the attention pattern on the same decision heads pulls the model back toward the legitimate source. It's not a perfect replication. The numerical effects are weaker on Geo-Bench than on the controlled benchmark, and the paper doesn't fully redo the rank-1 routing feature analysis or the upstream keyword localization for the GEO scenario. But the spine of the mechanism — decision heads, tetrahedral geometry, attention rerouting — holds.

15:51Tyler: So the answer to "does this matter for real systems" is a qualified yes. The architecture of the failure isn't an artifact of one toy benchmark — the same shape shows up in a setting much closer to what an actual attacker on the open web would do.

16:08Bella: Tyler, we should sit with the steelman of this, because I think there are at least two places where a careful reader would push back, and the second one is real.

16:18Tyler: Right, and the first one is what I started gesturing at. The whole analysis lives inside a setup where the model has to output a single number — pick one of four — as its first token. That's structurally a token-copy problem. There are literally option tokens in the prompt waiting to be attended to and copied. Is it any surprise that a copy-routing circuit emerges as the mechanism? The task practically demands one. The harder question is what happens when the model has to generate an answer rather than pick one. If someone asks an LLM a factual question and it has to write a paragraph in response — no option tokens to point at — is there still a decision head, still a tetrahedron, still a one-dimensional routing feature? We don't know. The authors are explicit about this in their limitations. They've mapped a specific kind of persuasion in a specific kind of task. The mechanism might generalize. It might not.

17:18Bella: And to their credit they don't hand-wave it. They name it as the obvious next experiment. But we should be honest that the most evocative parts of the paper — the tetrahedron, the discrete vertex jump, the single dial — those properties are tied to a four-option choice geometry. In free-form generation, the geometry of the choice isn't the geometry of "which of four tokens do I attend to." So the form of the answer may not transfer, even if the spirit of "narrow circuit doing a copy job" does.

17:50Tyler: The second push is sharper. The rank-1 approximation is well-validated — the reconstruction error is small, the cross-validation holds up. But it's also a constrained search. You go looking for the single best direction that explains the routing behavior, and you find one. That doesn't tell you how much explanatory power lives in the next few components, or what those would represent. For interpretability that's mostly fine. For security it matters more, because an attacker who wants to bypass a defense built around the rank-1 feature would specifically design attacks that operate in the residual space — the small, unexplained corner of the routing behavior that the rank-1 lens doesn't see. The paper's framing is that they've found *the* routing feature. The honest framing is that they've found the dominant routing feature.

18:45Bella: That's fair. And I'd add a smaller note on the transfer claim. "The mechanism transfers to GEO" is what the abstract suggests, but as I said, the transfer is partial. The decision heads transfer cleanly. The geometry transfers cleanly. On Geo-Bench itself, attention-pattern patching recovers about fifty-six percent while patching the full head output recovers about sixty-nine percent — so the same gap between "where to look" and "what to copy" shows up, but the absolute magnitudes are lower than what you'd want. And the upstream keyword analysis and the rank-1 feature work haven't been fully replicated in the GEO setting. So the right summary is: the architecture of the mechanism transfers, with some loss of magnitude and some experiments not yet repeated.

19:36Tyler: Three honest dings. None of them undo the contribution. The contribution is the picture: a complete, causally verified circuit for one well-defined kind of persuasion. That picture is real even if the picture is incomplete.

19:51Bella: So what does this change? The paper itself doesn't build defenses — the authors are careful to say they've drawn the map, not built the guardhouse. But the map suggests a class of defense that wasn't really available before. If persuasion enters through a specific feature in a specific layer, you can in principle watch that feature at inference time. A runtime monitor doesn't need to retrain the model. It doesn't need to inspect the prompt for adversarial content — which is brittle, because attackers can always come up with new phrasings. It just needs to read one scalar off the residual stream at one layer and notice when that scalar spikes on an option token. The mechanism becomes the signal. You could also, in principle, intervene. Subtract the routing feature at inference time, and you blunt persuasion at its source. Not the prompt, not the output — the mechanism itself.

20:50Tyler: And the broader intellectual move worth flagging, separate from the specific defenses: this is one more datapoint in a slowly accumulating body of evidence that significant LLM behaviors are sparse. Sycophancy, factual recall, persuasion susceptibility — these keep turning out to live in small, identifiable subcircuits rather than being smeared across the whole network. Each new instance modestly strengthens the bet that LLMs, despite the scale, have legible parts. The persuasion circuit here is a particularly clean example. Roughly one head out of a thousand carrying the load on a behavior that benchmark studies report happening between twenty-nine and sixty-two percent of the time in real high-stakes domains. The defensive surface is much narrower than the attack surface, which is, for once, good news.

21:45Bella: There's one detail I want to flag because I think it's underappreciated. The entire analysis runs on a single GPU. No training, no fine-tuning, just careful inference-time experiments. The story you sometimes hear about interpretability — that it's only feasible for labs with massive compute — isn't really true at this scale of investigation. The bottleneck is methodological cleverness, not hardware.

22:11Tyler: Which is healthy to remember, Bella. The most useful interpretability findings often come from sitting with a model for a long time and asking the right intervention questions, not from throwing more compute at the problem.

22:25Bella: There's a sentence from the paper I keep coming back to. The authors write that the copy circuit does not need to create the choice representation — the representation is already there, waiting to be copied. Which is a quiet way of saying something profound about how at least some of what LLMs do isn't computation in the way we usually imagine. It's a kind of disciplined routing of information that was already in the input. The vulnerability follows from that fact. If the model is mostly copying, then controlling what it copies is mostly a matter of controlling where it looks.

23:02Tyler: And the door is narrow. That's the part worth holding onto.

23:06Bella: This has been AI Papers: A Deep Dive. The paper's linked in the show notes, along with some related reading if this is your kind of thing. Thanks for listening.

How LLMs Get Persuaded: One Attention Head, A Tetrahedron, And A Single Dial

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes