All episodes

Episode 175 · Jun 26, 2026 · 26 min

One Crosscoder Feature Flips a Stalling Chatbot Into a Working Agent

Shportko, Bhokare, AlZahrani et al.

paperdive.ai

Listen

Ep. 175

One Crosscoder Feature Flips a Stalling Chatbot Into a Working Agent

0:00

26 min

Concepts in this episode

Mechanistic Interpretability AI Safety Agentic AI Sparse Features / SAE RL Post-Training Tool Use Activation Steering Superposition Hypothesis Capability vs. Propensity Structural Transfer Agentic RL Residual Stream

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Localizing RL-Induced Tool Use to a Single Crosscoder Feature

Venue

arXiv:2606.26474

Year

2026

Read the paper

arxiv.org/abs/2606.26474

Also available on

Apple Podcasts Spotify

Reinforcement learning spent a whole training run teaching a model to use tools — and it turns out you can find that skill, grab one internal feature, and flip the behavior on at runtime with no retraining at all. But the same evidence that says the skill lives in one place also shows it quietly leaking into a model that was never trained for it. This episode unpacks what RL actually localizes, where it lives, and why you can concentrate a capability but never fully wall it off.

What you'll take away

Why a single 'dedicated' crosscoder feature, steered at inference time with no weight changes, can recover most of an RL model's tool-calling accuracy
How just routing activations through the sparse dictionary and back raises tool correctness from 19% to ~50% — even though reconstruction quality barely predicts the gain
The 'capability spillover' result: a frozen base model, never trained for tools, picks up tool selection (0% to ~7%) just by passing through the shared crosscoder — but never reproduces the tool-call syntax
Why the exclusive feature shelf is a coffee filter, not a sealed sink — penalizing it degrades the RL model, proving the captured signal is load-bearing and leaky
The honest limits: the +65 number comes from one best-performing cell on 40 prompts with a wide confidence band, and the DFC's advantage is legibility, not better performance
Why the cleanest features are structural-template detectors — and why that may be exactly why a tool-calling skill concentrates into one dial when a messier capability might not

Chapters

00:00Where does an RL skill actually live?
02:34Reading the model's muddy scratchpad
04:26Bolting down the shelves: the DFC
07:13One master switch versus a fuse box
09:29Feature 136 turns a hedger into an agent
11:03Why lossy reconstruction makes it better
13:09A frozen model catches the trick
15:10A coffee filter, not a sealed sink
18:22How soft is that headline number?
22:08When your interpretability tool leaks

References in this episode

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning — The Anthropic sparse-autoencoder work that grounds the episode's 'separate the m
Sparse Crosscoders for Cross-Layer Features and Model Diffing — The original crosscoder writeup that introduced the shared-dictionary model-diff
Toy Models of Superposition — The foundational account of why a few-thousand-dimensional scratchpad packs far

Full transcript

Also available as a plain-text transcript page.

0:00Bella: Here's a model that refuses to do its job. You hand it a task — split this list into chunks of two — and it stalls. "Could you please provide more details about the task?" It never acts. Then researchers reach inside the network, turn up a single internal feature, and it changes its mind. It reasons through which function fits, and writes out a clean, correct tool call. One feature. And across their experiments, flipping that one dial is worth sixty-five percentage points of tool-calling accuracy.

0:34Tyler: Quick heads up before we dig in — this is an AI-generated explainer, both voices included. And the reason that one-dial result matters is what it implies. Reinforcement learning spent an entire training run teaching this model to use tools. And it looks like the thing it installed is sitting in a spot you can find and grab at runtime — without touching the weights.

0:59Bella: That's the promise. By the end you'll know where an RL-trained skill actually lives inside a language model, whether you can point at it and steer it with no retraining at all — and why the very same evidence that says "it's in one place" also shows that ability quietly leaking into a second model that was never trained for it.

1:21Tyler: This matters because the next wave of AI systems are agents — they call tools, hit APIs, take actions that have consequences. Right now we can measure what RL teaches them, but we mostly can't point to it, audit it, or switch it off. Everything in the standard toolkit — the various flavors of fine-tuning — needs retraining to change behavior. The dream here is different: find the relevant machinery, turn a knob, done. This paper is a test of whether that's even possible.

1:54Bella: So let's start with the puzzle. You take a plain chatbot — in this case Qwen2.5, a three-billion-parameter model — and you fine-tune it with reinforcement learning on tool-use tasks. The "before" model talks about doing things. The "after" model reliably emits a structured tool call: a machine-readable block that names a function and its arguments. The behavior change is obvious. What's invisible is the mechanism. Did RL build a brand-new capability? Amplify something already latent? And where, physically, in the network does "I should call a tool now" actually live?

2:32Tyler: And I want to flag one thing early, because the authors are careful about it and we should be too. When they say "capability," they mean something specific — the propensity to emit a correct tool call under a fixed prompt. They're not claiming the base model is incapable of tool use in any absolute sense. So everything we say about installing or transferring a capability is really about shifting that propensity. Keep that in your back pocket; it comes back.

3:03Bella: Fair. So how do you even look for where a behavior lives? You have to understand one thing about how these models hold information. As the model processes text, it carries a long running vector through itself — call it a scratchpad — that every layer reads and writes. The catch is the scratchpad doesn't store one concept per slot. A network with a few thousand dimensions juggles far more than a few thousand distinct concepts, packing overlapping meanings into the same space. That packing has a name — superposition.

3:38Tyler: Think of the model's internal state at each step as a bucket of mixed paint. It's one muddy color, but it's actually many pigments blended together. You can't read it directly.

3:50Bella: Right. So the tool researchers use is a sparse dictionary. It's a learned list of directions — call them features — with one rule: only a handful are allowed to be active at once for any given input. You train it to rebuild the model's scratchpad using as few active features as possible. And the payoff is that the recovered features tend to be roughly single-meaning. One fires for "this is about Python lists," another for "this is a polite request." It separates the mud back into named pigments.

4:25Tyler: And the twist this paper leans on is doing that for two models at the same time. A crosscoder is one shared dictionary trained to reconstruct the internal vectors of both the base model and its fine-tuned cousin. The reason you'd want that is what the field calls model diffing — you get to see which features both models use, versus which ones belong to only one. If RL installed something new, a crosscoder is how you'd hope to spot it.

4:56Bella: But a plain crosscoder has a problem for this question. It's one undifferentiated pool. Every feature can serve either model freely, so "which features are RL-specific" is something you have to tease out after the fact.

5:11Tyler: Which is where the architecture under test comes in — the Dedicated Feature Crosscoder, the DFC. Picture a shared toolshed used by two workers. The DFC bolts down the shelving. Some shelves are labeled "RL model only," some "base model only," and there's a communal rack in the middle. And it's enforced mechanically — during training they literally zero out the wrong connections, so a feature assigned to the base model physically cannot write into the RL model. Plain crosscoder: one pool. DFC: three labeled bins.

5:48Bella: So the sharp, testable question is — when RL installed tool use, did that capability land neatly on the RL model's private shelf?

5:57Tyler: That's the whole experiment. And here's the distinction to hold onto, because it's the honest center of the paper. What they can actually show is that the DFC reaches the same behavioral ceiling as a plain crosscoder, but with far fewer features you can point at. That's a claim about legibility — about being able to find the switch — not a claim that the DFC is a better model. Those are different things, and the gap between them matters later.

6:27Bella: Let me give you the plain-language takeaway up front, so it's yours even if you wander off. RL's tool-use ability concentrates into the DFC's exclusive shelf tightly enough that one feature, steered at runtime, flips the behavior on. But — second finding — it doesn't concentrate perfectly. Some of it leaks into the communal shelf, which is why a frozen base model passively picks up a bit of the skill, and why squeezing the exclusive shelf actually makes the RL model worse. Concentration, not isolation. That's the story, three ways.

7:03Tyler: And the mechanism behind all three is next — it pays off in a single before-and-after example where one feature turns a stalling chatbot into a working agent. That's the money shot, and it's worth setting up properly.

7:17Bella: So, the steering. The procedure is simpler than it sounds. You rank the features by how cleanly they fire on tool prompts but stay dark on ordinary text — features that light up for tool-calling and basically nothing else, sorted by how sharply they separate the two. Then you take the top one's direction and add it back into the scratchpad at runtime, scaled by a gain knob. No weight changes. You're just nudging the running state in the direction that feature represents.

7:49Tyler: And the headline is the saturation curve. Walk me through what's on screen, because this is the figure that makes the case.

7:58Bella: So the screen shows accuracy against the number of features you steer. For the DFC, watch the very first point — you add one A-exclusive feature and the curve jumps to its peak, plus sixty-five points. Add a second feature, a third, all seven available — flat. No improvement. One feature already did everything. Now overlay the plain crosscoder. Same final height, roughly plus seventy — but it has to climb. It needs thirty-three separate features to get there.

8:28Tyler: So it's one master switch versus a fuse box. Same lit room either way. The DFC hands you a single labeled switch; the plain crosscoder makes you flip thirty-three breakers to reach the same brightness.

8:41Bella: Exactly. And notice what that is and isn't. It is not more light — both reach the ceiling. The contribution is that you can find and operate one switch instead of hunting the whole panel. That's legibility.

8:54Tyler: And there's a detail in there that's almost funny. They ran the obvious control — steer the base model's exclusive features instead. Those columns are zeroed out by construction, so the effect should be exactly nothing. And it was exactly nothing. Which is a good sign the harness isn't fooling them. They also found that combining the A-exclusive feature with shared features actually does worse than either alone — the directions interfere destructively. So the paint analogy breaks down right there: add two pigments and you can get less, not more.

9:30Bella: Now the case study, because this is where the number becomes a thing you can see. Same model, same prompt — a task involving list utilities, a function called split list. Before steering, on screen: "Could you please provide more details about the task?" It hedges. It asks for clarification it doesn't need. It never calls anything. Then they steer one feature — feature 136 — at a moderate gain. After: the model reasons, "The user wants to split the given list into chunks of two... we can use the split list function," and emits a syntactically valid tool call with the right arguments.

10:10Tyler: One feature flip, and a confused hedger becomes a competent agent. Same weights, same prompt.

10:16Bella: Same everything except one dial. And here's the part that reframes what RL actually localized. They ran automated interpretation on the top features to ask what they detect. The most discriminative ones aren't some abstract "tool reasoning" concept. They're structural-template detectors. They fire on the formatting markers — the literal tool-call and response tags. The cleanest thing RL concentrated is the syntactic scaffolding of a tool call. The shape of the command, not the idea of using one.

10:50Tyler: That's a clue we should bank, because it explains the next result and it's also where the critique eventually bites.

10:58Bella: Before steering, though, there's an even stranger result they stumbled into. It turns out you don't even need to steer anything. Just passing the model's activations through the crosscoder and back — encode, then decode, the plain reconstruction — already changes behavior. The RL model's tool correctness jumps from nineteen percent to about fifty. Just from being routed through the dictionary and back.

11:25Tyler: Wait — why would reconstruction help at all? Reconstruction is supposed to be lossy. You're throwing information away and rebuilding an approximation. If anything I'd expect the model to get worse, or stay flat.

11:39Bella: That's the intuition, and it's wrong in an informative way. The sparse bottleneck — only a few features active at once — acts like a filter that discards noisy components and keeps the small, behaviorally privileged subspace tool-calling actually needs. And here's the evidence that it's real: reconstruction quality barely correlates with behavioral improvement. The correlation is basically flat — about plus zero-point-zero-eight. Lower reconstruction error does not predict better tool use. So the task-relevant information lives in a small special subspace, and you can mangle everything else without hurting the behavior.

12:21Tyler: And the strength of that claim is the consistency, right? It wasn't one lucky run.

12:26Bella: That's the credibility backbone. They didn't cherry-pick a crosscoder. They trained forty-eight variants — sweeping the architecture, the dictionary size, the sparsity budget, how much of the dictionary is reserved as exclusive, and whether the exclusive shelf is penalized or free. The plus-thirty-one-points reconstruction gain is the mean across that sweep. And every single one of the forty-eight improved the RL model. Under a basic sign test, all forty-eight landing on the same side is essentially impossible by chance.

13:01Tyler: Okay. So routing through the dictionary helps the RL model. Now do the thing that shouldn't work.

13:07Bella: Now take the frozen base model — the plain chatbot, never fine-tuned for tools — and pass its activations through that same jointly-trained crosscoder. No training. No weight changes. And its tool correctness rises from zero to about seven percent.

13:25Tyler: From zero. The model that was never taught the trick starts doing the trick — just from being routed through a dictionary built partly from its smarter cousin.

13:35Bella: That's capability spillover. The shared decoder weights carry tool-calling intent into the base model's stream. And think of it like a shared translation glossary. You build a glossary by studying an expert translator, then a novice works through that same glossary — and the novice starts picking the right technical terms they never learned. Seven points is small in absolute terms, and the variance is wide, but statistically it's robust across the whole sweep.

14:06Tyler: And there's a split inside that result that's the single cleanest clue in the paper. The novice picks the right terms — but can they reproduce the expert's exact document formatting?

14:18Bella: No. And that's the tell. The base model gains tool selection — it learns to name the right tool, in prose. But format accuracy spillover is exactly zero. Across all forty-eight runs, the base model never once reproduces the actual tool-call template, the exact syntax. So the semantic choice — which tool — lives partly in shared directions that leak. The surface-form machinery — the literal template — stays locked in the RL model's private features.

14:49Tyler: Which lines up perfectly with what those top features turned out to be. The structural-template detectors are exactly the part that doesn't spill. The syntax stays private; the intent bleeds through the communal rack.

15:03Bella: So that's the second finding. Now the third — and this is the one that turns the DFC's own design against it.

15:11Tyler: Right, and this is the conceptual heart, so let me build it. The DFC's exclusive shelf was designed to be a sink — a sealed container that fully holds the RL-specific capability. The question is whether it actually behaves like one. So they test it directly. They add a penalty on the exclusive shelf — a knob that pushes the model to avoid using those private features unless it really has to. If the shelf were just holding redundant junk, penalizing it would be harmless. The capability would route around it.

15:45Bella: And it isn't harmless.

15:46Tyler: Not at all. Penalize the exclusive shelf, and the RL model's tool fidelity degrades — the gain drops from about thirty-five points down to twenty-six. So the shelf is holding real, load-bearing signal. Squeeze it, and that signal doesn't vanish — it gets pushed back into the communal rack, where it's less efficient for the RL model but more available to leak into the base model. So it's not a sink. It's a coffee filter. It catches most of the grounds, but some fines slip through into the cup.

16:20Bella: And that single reframe ties all three results into one phenomenon. The filter concentrates the strongest model-specific residue — that's why one feature can steer it. But it leaks — that's why the base model inherits a little. And the leak is load-bearing — that's why squeezing it hurts. Filter, not sink. Same fact, three faces.

16:43Tyler: And there's a deeper claim underneath it. The reason you can't seal the capability off is that the two models' representations aren't orthogonal. Their capabilities are entangled in shared geometry. Which is a real statement about model diffing in general — it suggests perfect isolation may not be achievable even in principle. You can concentrate a capability. You may never be able to fully wall it off.

17:10Bella: They also checked that this geometry is real and not an artifact of the picture. When you project the feature directions down to two dimensions, the DFC produces three visually separate islands — RL-exclusive, base-exclusive, shared. The plain crosscoder produces one undifferentiated blob.

17:30Tyler: And here's the move I respect, because it's the obvious objection. A skeptic says: of course the DFC looks cleaner — one of those bins is small, and these projection plots can manufacture clusters out of nothing. So they built a fake matched-size partition for the plain crosscoder — slicing it into bins of the same proportions — and re-ran the identical pipeline. The separation vanished. Cluster recovery collapsed to roughly chance. So the clean geometry comes from the architecture forcing the split, not from how they happened to label things.

18:05Bella: So that's the case the paper makes. One steerable feature, a frozen model inheriting the skill, and a filter that concentrates but leaks. Where does it not hold up?

18:15Tyler: So this is the part I want to be careful about, because the paper is unusually honest about its own soft spots, and the strongest version of the critique is mostly amplifying what the authors already concede. Start with that headline number — the plus-sixty-five points from one feature. It comes from the single best-performing cell of the sweep, measured on forty prompts. Forty. And the confidence interval runs from about plus forty-eight to plus eighty-two. They explicitly say that best-cell number was not re-run on a fresh sample. So treat sixty-five as a favorable draw with a wide band around it, not a stable point estimate.

18:56Bella: That's a real caveat. Does it sink the result?

18:59Tyler: No — but it right-sizes it. And the second piece is sharper. The cleaner story we'd want to tell — "the DFC reduces spillover and concentrates capability better than a plain crosscoder" — doesn't actually reach statistical significance. The architecture comparison is twelve crosscoder runs against thirty-six DFC runs, and the difference washes out. And on raw behavioral ceiling, the two architectures are statistically indistinguishable. So the DFC does not work better. It works more legibly — same ceiling, fewer features to point at. That's the defensible claim, and it's narrower than the headline makes it sound.

19:44Bella: And the word "capability" is carrying weight it maybe shouldn't.

19:48Tyler: That's the third edge, and it's the one I'd push hardest. "Capability" here is shorthand for "propensity to emit a correct tool call under one fixed prompt." They never ran a base-model baseline with a prompt tuned to maximally elicit tool use. So "RL installs tool-calling" and "the base model inherits it" should both be read as modulating a propensity under a fixed prompt — not installing or transferring an ability from scratch. A well-prompted base model might already do better than that seven-point spillover suggests.

20:26Bella: And the generality question.

20:28Tyler: One model pair, one task. And here's the worry that connects back to that earlier clue — the top features were structural-template detectors. Tool-calling has an unusually crisp syntactic signature: a rigid template with literal tags. The single-feature result may be clean precisely because the capability is dominated by that template. A messier capability — one without a fixed surface form — might not concentrate into one dial at all. We don't know yet. And on top of that, they only ran interpretation on the RL model's exclusive features. The shared shelf — the one they argue carries most of the spilled capability — was never actually inspected. So "the capability is delocalized into shared weights" is inferred from behavior, not confirmed by looking at what those shared features are.

21:22Bella: I'll concede all of that. The point estimate is soft, the architecture comparison is underpowered, and "capability" is a propensity under one prompt. What I won't give up is the qualitative shape — one labeled feature reaching the ceiling, a frozen model inheriting some of the skill, and a partition that leaks under pressure. Those three line up too cleanly to be noise, even if the exact magnitudes move.

21:51Tyler: That's fair. And honestly, the soft point estimate and the real qualitative finding can both be true. The contribution is the existence of single-feature steering and spillover — not the second decimal place.

22:05Bella: So let's pull back to why this is worth anyone's attention. Before this work, "RL makes models good at tool use" was a black box. You measured the behavior; you couldn't locate it. Now there's concrete evidence that an RL-installed agentic capability can condense into features you can identify, and that you can grab one and steer the behavior at inference time, with zero retraining. That's a genuinely different operating point from the entire fine-tuning family, which all need retraining to change what a model does.

22:41Tyler: And the safety angle cuts both ways, which is what makes it interesting. The same machinery that turns a behavior up by steering a feature can in principle turn an unwanted behavior down by clamping that feature toward zero. That's a monitoring-and-control story for agents that take real actions — legible, gradient-free, inference-time.

23:03Bella: But the spillover result is a warning on the other edge, and I think it's the part the paper underplays.

23:10Tyler: It really is. Think about what spillover means for what you publish. If you train a crosscoder between a strong model and a weaker one and release that artifact, you may have released a substrate that reintroduces the stronger model's capability at inference time. The diffing tool itself becomes a side channel that leaks ability across a model boundary. For anyone reasoning about what's safe to ship alongside a model, that's a non-obvious consideration — the interpretability artifact isn't automatically neutral.

23:45Bella: So here's the real takeaway, bigger than any single number. The lasting result isn't "one feature flips tool use" — it's what that, plus the spillover, plus the leaky filter add up to. A capability installed by training isn't always a thing in a place you can cleanly extract and wall off. Sometimes it's entangled in shared geometry, and the best you can do is concentrate it, never fully isolate it. That reframes what model diffing can honestly promise — and where it quietly becomes a way to move capability around.

24:19Tyler: So here's the question to chew on. If you can't perfectly isolate a capability — only concentrate it — which way does that point for AI safety? Is feature-level steering a real handle on agent behavior, something to build monitoring and shutoff on top of? Or does the spillover make these diffing artifacts too leaky to trust, a side channel we shouldn't be publishing in the first place? So which is it for you — a real safety handle, or a leak we shouldn't be publishing? Stake out your call.

24:52Bella: The full annotated version of this episode is on paperdive.ai — every technical term tap-to-define, with links to the related papers grouped by theme, from sparse autoencoders to the original crosscoder work, plus our weekly and monthly roundups.

25:10Tyler: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Bella and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is "Localizing RL-Induced Tool Use to a Single Crosscoder Feature," out of Northwestern, published June 25th, 2026 — we recorded this the next day.

25:34Bella: One feature turned a hedger into an agent, and a frozen model caught the trick through the wall between them. The switch is findable. The wall isn't sealed. That's the tension worth sitting with.