Building Forgetting Into a Language Model With One Extra Line of Code
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
What if you could delete everything a model knows about Harry Potter by flipping a switch — no retraining, no weights changed, and the content provably gone rather than just hidden? A new paper argues the long-assumed trade-off between models that learn well and models you can edit cleanly was never real. We walk through how the trick works, why it survives the attacks that break today's unlearning, and where the cleanness might be doing some quiet work.
What you'll take away
- Why today's post-hoc unlearning is a coat of paint — the 'forgotten' content comes flooding back in under ten fine-tuning steps
- The actual intervention: one extra line of code that masks a bank of 'sink' neurons, with which neurons a source gets decided by a pseudo-random seed (so six million Wikipedia articles each get their own switch with no growth in model size)
- How knowledge sorts itself automatically — unique facts migrate to a source's private sinks via training interference, while shared knowledge stays in the backbone, with no hand-labeling
- Why the relearning and adversarial-prompt attacks that broke old methods fail here: the switched-off content tracks a model that never saw it at all — closer to amnesia than scar tissue
- The capability cost rounds to zero — roughly 56% on standard benchmarks, statistically indistinguishable from a plain transformer
- The catch worth scrutinizing: the 'off' condition routes queries to the nearest surviving source, which may inflate how cleanly the architecture preserves related knowledge — plus it only works at 1B parameters and only for unlearning requests that respect pre-defined source boundaries
Chapters
- 00:00The switch demo and why forgetting is hard
- 02:43Why post-hoc unlearning fails
- 05:27The apparent conflict between learning and forgetting
- 08:11The mechanism: backbone, sink neurons, and seeds
- 10:55Testing at scale: six million Wikipedia switches
- 13:39The robustness tests on Harry Potter
- 16:23Pushback: routing, taxonomy, and what the cleanness hides
- 19:06Limitations and the data-attribution upside
References in this episode
- Who's Harry Potter? Approximate Unlearning in LLMs — The original Harry Potter unlearning paper using post-hoc fine-tuning, the exact
- Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning — Introduces NPO, one of the named post-hoc unlearning methods the episode critiqu
- TOFU: A Task of Fictitious Unlearning for LLMs — The benchmark that popularized the Truth Ratio metric this episode leans on to m
- Eight Methods to Evaluate Robust Unlearning in LLMs — Surveys relearning, compression, and jailbreak attacks that recover supposedly-f
Full transcript
Also available as a plain-text transcript page.
0:00Bella: There's a claim in this paper that, a couple of years ago, would have gotten you politely shown the door in most machine learning labs. You can delete an entire book from a trained language model — wipe out everything it knows about, say, Harry Potter — without changing a single one of its weights. No retraining. You flip a switch, and the book is gone. And here's the demo that makes it real. They feed the model the actual opening line of the series — Mister and Missus Dursley, of number four, Privet Drive, were proud — and with the switch on, the model runs wild with it. Hogwarts, Dudley, characters that weren't even in the prompt. Then they flip the switch off, same exact prompt, and the model writes a perfectly fluent little passage about a city council meeting at some association of colleges and universities. No magic. No wizards. The Harry Potter knowledge isn't garbled or broken — it's just cleanly not there, and the model stays completely coherent. The paper is called "Natively Unlearnable Large Language Models," it went up on arXiv on June eleventh, twenty-twenty-six, and we're recording four days later. Quick note before we dig in: this episode is AI-generated. The script was written by Anthropic's Claude Opus 4.8, and the two voices you're hearing — I'm Bella, and my co-host is Finn — are both AI voices from Eleven Labs, with no affiliation to either company. And that switch I just described is the whole paper in miniature. So the place to start isn't the switch itself — it's why forgetting in a language model has been such a nightmare up to now.
1:38Finn: And it really is a nightmare, in a way that's easy to underestimate from the outside. A model doesn't store facts in labeled files. During training it nudges billions of numbers — the weights — until it predicts text well on average. A single fact, like a character living at number four, Privet Drive, isn't sitting in one location. It's smeared across thousands of weights, and every one of those weights is simultaneously encoding thousands of other things. The paper puts it plainly: standard training entangles all your data sources. Gradient descent mixes them into a single shared set of weights. The way I picture it, it's like a blender — every training document gets poured in and pureed across the entire model. So there's no clean Harry Potter section you can cut out. The fact you learned from one book is tangled up with facts from a thousand other sources.
2:29Bella: Which means when you go in afterward and try to make the model forget one thing, you don't get to be surgical. You're working with a blended smoothie and trying to remove the strawberry.
2:40Finn: Exactly. And the standard way people do this today is what's called post-hoc unlearning — you take the finished model and fine-tune it to suppress the target content. Methods with names like Negative Preference Optimization, gradient ascent — they basically train the model to assign low probability to the stuff you want gone. And on the surface it works. You ask about the forgotten content, the model clams up. But here's the result that should make anyone nervous about relying on it. The authors take one of these post-hoc unlearned models, the kind that looks like it's forgotten Harry Potter, and they do a tiny bit of fine-tuning — a sprinkle of extra training. The forgotten content comes flooding back in under ten steps.
3:24Bella: Ten gradient steps. That's nothing.
3:27Finn: It's nothing. That's the whole point. It tells you the content was never actually gone — it was hiding under a thin coat of paint, and you can scratch through it almost instantly. And it's not just fine-tuning. Other work has shown the supposedly-forgotten content can come back if you just compress the model, or if you find a clever enough prompt. So for anything legally serious — a copyright takedown, a privacy-deletion request — "we mostly hid it, and a motivated person can recover it in ten steps" is not a defensible position. That's the bar this paper is trying to clear.
4:02Bella: So that sets up the real tension, and it's a genuinely deep one. Think about what makes forgetting easy versus what makes learning good. Forgetting is easiest when every source lives in its own separate box — totally disentangled, so you can lift one out without disturbing the rest. But learning is best when the model pools everything together, shares representations across all its sources. That's the entire reason big models work — knowledge from one place reinforces knowledge from another.
4:32Finn: And those two wishes point in opposite directions.
4:35Bella: They look like sworn enemies. Isolation gives you clean removal but kills the knowledge-sharing that makes the model smart. Sharing gives you a smart model but blends everything into an unremovable smoothie. And the field had basically accepted you have to pick a side. There were modular approaches — give every source its own dedicated expert module — and those are clean to remove, but they don't share knowledge and they completely fall apart when you've got millions of sources. You'd need millions of modules. The whole emotional payload of this paper is the claim that these two goals were never actually incompatible. We'd just never built an architecture that delivered both.
5:16Finn: Okay, so that's the promise. How do they actually pull it off?
5:21Bella: This is where it gets genuinely elegant, and I want to build the picture before the mechanism. Imagine a workshop. There's a big central bench in the middle with a shared set of tools that everybody uses — that's where the common, reinforced knowledge lives. But every worker also gets a personal locker, and they're only allowed to open their own. Now, inside a transformer, each layer has this component — the feed-forward block — that's basically a wide panel of artificial neurons. Picture a long row of light bulbs. Normally, every single training document lights up and adjusts the entire panel. What this architecture does is split the panel into two parts. A small section that's always on for every document — they call it the backbone, and that's your shared central bench. And then a huge bank of switchable bulbs — the sink neurons — where any given source is only allowed to light up a tiny, specific handful.
6:21Finn: And which handful a source gets — who decides that?
6:24Bella: A pseudo-random function of the source's identity. You take the source's name or ID, feed it into a random number generator as a seed, and out comes the pattern of which bulbs that source is allowed to touch. The analogy I love here is video game worlds. Some games can regenerate an entire enormous landscape from one short seed number — same seed, same world, every time, and you never have to store the map. The source ID is the seed. The mask — the on-off pattern of sink neurons — is the world it regenerates. So you never store a giant lookup table. You can recompute any source's mask instantly, on the fly, during training or at deployment.
7:06Finn: And because it's effectively random, two articles on nearly identical topics still get unrelated patterns.
7:13Bella: Right — they don't collide, even if they're about the same thing. And here's the detail that makes it land for me. The actual architectural change, the entire intervention, is one extra line of code. A standard model's feed-forward pass, with a single added step that multiplies the neuron activations by that mask of ones and zeros, switching off all the bulbs that don't belong to the current source. That's it. Everything else in the transformer is untouched.
7:42Finn: Okay, wait. I want to make sure I've got this right, because something's bugging me. So Harry Potter gets quarantined into its own private locker from the very start — the shared backbone never learns it at all? It only ever lives in those dedicated sink neurons?
7:58Bella: That's the natural assumption, and it's not quite what happens — and the difference is the cleverest part of the whole paper. Nobody tells the model which facts are Harry-Potter-specific. There's no labeling step. What happens instead is a training dynamic that sorts the knowledge automatically. Think about a fact that only appears in one source. The shared backbone is active on every document, so it does get a gradient signal for that fact whenever Harry Potter shows up. But the backbone is also getting hammered with competing, interfering updates from every other source fighting over those same shared neurons. The sink neurons for Harry Potter, though, are active almost only when Harry Potter shows up. They see that same fact with far less interference. So the sinks fit the unique fact first — they're just a quieter room to learn it in.
8:50Finn: And once the sinks have captured it —
8:53Bella: The pressure on the backbone disappears. Once the fact is safely stored in the sinks, there's no more gradient signal pushing the backbone to hold onto it, and whatever faint trace was there just decays away. So you end up with this clean separation that nobody designed by hand. The backbone holds only what's reinforced across many sources — the genuinely shared knowledge. And each source's truly unique knowledge migrates into its own private sinks. So to make Harry Potter disappear, you don't ransack the workshop. You just lock that one locker — switch off those sink neurons. The shared tools on the central bench are completely untouched. You remove exactly what was unique to the source, and you leave everything it had in common with the rest of the world fully intact.
9:43Finn: That's a really satisfying mechanism, because the sorting is emergent. You're not paying a curator to decide what's unique. The interference does the work for free.
9:54Bella: That's the reframe the authors are most proud of, and I think it's the durable contribution even if this specific design gets replaced. Their line is to treat unlearnability as a property you design into a model during training, not a behavior you try to claw out of it afterward. Forgetting stops being a research project you launch after a takedown notice arrives. It becomes a toggle that was built in from day one.
10:20Finn: So let me carry the question everyone's now asking, which is — does it actually work, and does it work at any kind of real scale? Because elegant mechanisms have a way of falling apart the moment you point them at a real corpus. They run two very different tests. The big one is Wikipedia. They take a one-billion-parameter model and train it on roughly six million Wikipedia articles, and — this is the audacious part — they treat every single article as its own independent, separately controllable source. Six million switches.
10:54Bella: And the only reason that's even possible goes back to the seed trick.
10:58Finn: Exactly that. With a fixed pool of sink neurons and a fixed number lit per source, the number of distinct possible patterns is astronomically large. So millions of sources each get their own controllable pattern without the parameter count growing at all. The modular "one expert per source" approach would need six million modules. Here, the model is roughly the same size as a standard one. To measure whether a fact survives, they use something called the Truth Ratio. Don't worry about the formula — the intuition is the whole thing. It asks the model one question: when you look at the right answer versus a set of tempting wrong answers, which way do you lean? If it tilts toward the truth, the score is above one and they count that as "the model knows this." Near zero, the fact is effectively gone.
11:48Bella: And the test is: turn off one article's sink, and watch what happens to its facts.
11:53Finn: Right. And they sort facts into two buckets that matter. Facts unique to that one article, and facts it shares with related articles. When they flip the sink off, the unique facts collapse — that Truth Ratio drops toward zero. But the shared facts survive, basically untouched. And critically, that pattern closely matches the gold standard, which is retraining the entire model from scratch without that article ever in the data.
12:20Bella: Which is the comparison that actually counts. You don't just want the unique stuff gone — you want the damage profile to look exactly like a model that genuinely never saw the source.
12:31Finn: And the contrast with post-hoc methods is stark. Those gradient methods can't tell source-specific knowledge from topically-adjacent knowledge, so they degrade both at the same rate. They take the strawberry and a chunk of the banana. This thing slices out the strawberry alone.
12:48Bella: Now the Harry Potter study is the other test, and that's the robustness one — the one that goes after that ten-step problem you opened with.
12:57Finn: This is the part that convinced me there's something real here. So they train on a big web corpus plus all seven Harry Potter books as one source, and they confirm the basic switch works — that's the city council passage from the top of the show. But then they attack it. First attack: the relearning attack, the same one that broke post-hoc unlearning in under ten steps. They take the model with the Harry Potter sink switched off and they fine-tune, actively trying to coax the content back. And the relearning curve tracks a model that never saw Harry Potter at all.
13:31Bella: Tracks the never-trained baseline. Not the post-hoc model that snapped back instantly.
13:37Finn: Tracks the never-trained baseline. The analogy I'd reach for is the difference between scar tissue and amnesia. Post-hoc unlearning is a memory painted over — scratch it and it bleeds through. This is closer to genuine amnesia, because the content was never written into the shared substrate in a reachable way. Trying to relearn it looks exactly like teaching it fresh to a model that's never encountered it. And they don't stop there. They run an automated adversarial prompt attack — a tool that searches for the magic prompt suffix to jailbreak the content back out. They measure how much of the original text they can extract, and with the sink off, it matches the from-scratch retrained model. The content isn't lurking latently, waiting for the right key. It's structurally unreachable once the switch is off.
14:26Bella: And the cost of all this, in terms of the model's general ability?
14:30Finn: Basically nothing. Across a standard battery of reasoning and knowledge benchmarks, this architecture averages about fifty-six percent, and a plain standard transformer averages — also about fifty-six percent. Statistically indistinguishable. Within noise. You get the switchability for free, capability-wise.
14:48Bella: I'll just say — there's an aesthetic pleasure in a result where the headline intervention is one line of code and the capability cost rounds to zero. The paper is doing a lot with very little.
15:00Finn: It is. So now let me push, Bella, because there are a few places where I think the cleanness is doing some quiet work that deserves scrutiny.
15:08Bella: Go for it.
15:09Finn: The one that nags at me most is how they evaluate the "off" condition. When you switch off a source's sink, the model doesn't just go blank — they activate the next-closest source instead, the most semantically similar one it still has. The idea is to simulate a real query arriving for a source that no longer exists, so the model routes to whatever's nearest. Like calling the next department over when the one you wanted got shut down.
15:35Bella: And that's a reasonable thing to simulate.
15:37Finn: It is reasonable. But look at what it does to the result. The whole impressive finding is "shared and related facts survive when you unlearn." But of course they survive — you've just routed the question to a neighbor who genuinely knows the overlapping material. So how much of that preservation is the architecture cleanly isolating unique knowledge, versus how much is baked into the routing choice? The neighbor knowing the shared facts is partly why the shared facts look preserved. I don't think that fully separates the architecture's contribution from the evaluation setup.
16:11Bella: That's fair, and I'd concede the routing makes the preservation result look cleaner than a more adversarial setup might. But I'd separate two claims. The preservation-of-shared-knowledge result — yes, the routing is load-bearing there. The collapse-of-unique-facts result, and the robustness results — the relearning attack, the extraction attack — those don't lean on the routing in the same way. The unique facts collapsing toward zero, the content tracking a never-trained model under attack — that's about whether the information is structurally reachable, and the routing doesn't manufacture that.
16:50Finn: I'll grant the robustness results stand more independently. I'm still not convinced the "we preserve related knowledge" headline isn't partly an artifact of how they chose to route the off condition. That one stays open for me.
17:04Bella: That's a fair place to leave it. And there's a related bit of bookkeeping worth flagging in the same spirit. Some article-specific facts keep a high Truth Ratio even with the sink off — the unlearning didn't fully remove them. The paper handles this by creating a third category, "inferred facts" — things the retrained model could also predict from related knowledge, so it's fine that they survive. That's a defensible move. But it is a post-hoc category, and it absorbs exactly the cases where forgetting was incomplete. The cleaner the separation looks in the charts, the more it depends on you accepting that taxonomy.
17:43Finn: And then there are the limitations the authors are quite upfront about, which I appreciate. The biggest one: everything here is at one billion parameters. The entire mechanism rests on a particular balance of interference — sinks fitting facts before the backbone does. Whether that delicate balance holds at seventy billion, or four hundred billion, on far larger and more redundant corpora, is genuinely unknown. Plausible, but not demonstrated at frontier scale.
18:12Bella: And the second big one, which I think matters most for whether this ever ships: sources have to be defined before training. You partition your data into non-overlapping sources up front, and unlearning only works for requests that line up with those boundaries. But a real takedown might not respect your partition — it could span sources, or cut across one. The authors concede that multi-source queries are unhandled. A prompt that touches several sources just gets routed to the single closest sink. So this works beautifully for "forget this article" or "forget this book." For "forget every mention of this person, wherever it appears," you'd need to have anticipated that as a source boundary at training time.
18:57Finn: Which is a real gap, because the messy real-world requests are exactly the ones that don't come pre-sorted.
19:03Bella: There's also the honest caveat that the whole evaluation is fact-recall on fill-in-the-blank questions. That's a proxy. Verbatim regurgitation, or picking up an author's style, could in principle survive a recall test even when the facts are gone — though the generation demo and the extraction attack do push back on that worry pretty hard. The city-council continuation isn't just failing a quiz. It's fluent prose with no Harry Potter in it.
19:31Finn: And one more they flag without solving — they only studied pretraining from scratch. How you'd take one of these models and then do instruction tuning or reinforcement learning on top, without wrecking the ability to unlearn the original pretraining sources, is unsolved. They speculate about it. They don't implement it. So this is a research result, not a product.
19:54Bella: Which is the right frame to end on. There's one forward-looking thread the authors gesture at that I find genuinely exciting, though — almost as a side effect. Because flipping a source's sink off approximates retraining without that source, you can suddenly measure what any single source contributed. Compare the model with that sink on versus off, cheaply, no retraining. That's a door into principled data attribution — figuring out which training data actually mattered, tracing an output back to a responsible source, spotting the redundant or low-value data. They don't build it out, but it falls right out of the architecture.
20:34Finn: And that's the bigger vision, right — controlling a model at the level of its data, not just its outputs. Whether or not this specific design survives, that reframe feels durable. Forgetting as a built-in property, not an afterthought.
20:48Bella: That's the thing I'll carry out of this paper. For years the assumption was that you have to choose — a model that learns well, or a model you can edit cleanly. The quiet argument here is that the choice was never forced. The two goals were compatible all along; we just hadn't built the architecture that held both at once. And the cost of finding that out turned out to be one extra line of code.
21:13Finn: At one billion parameters. With the routing question still hanging over the preservation result. But yeah — as a proof that the trade-off was never fundamental, it's a lovely piece of work.
21:25Bella: The paper is "Natively Unlearnable Large Language Models," out of Carnegie Mellon and Datology AI. The link's in the show notes, along with some further reading if you want to go deeper on the machine unlearning side.
21:38Finn: And if you want the full transcript with every term defined inline, plus the concept pages that connect this to the other episodes we've done, that all lives on paperdive.ai.
21:50Bella: Thanks for listening to AI Papers: A Deep Dive.