What If a Prompt Injection Never Left? Attacks That Wait in Agent Memory
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
Once an AI agent gains durable memory, the attacker no longer has to be in the room when the damage happens — they plant an instruction once and the agent's own startup routine pulls the trigger days later, in a totally different user's session. Drawing a sharp parallel to the stored-versus-reflected XSS attacks that haunted the web for a decade, a new paper measures exactly how often these cross-session attacks survive end to end. The answer, and the surprising split between which attacks work and which fail, is worth your attention if you build anything with agent memory.
What you'll take away
- Why classic prompt injection is contained to one session, while 'cross-session stored prompt injection' decouples the moment of injection from the moment it fires — reframing the problem from malicious input to state contamination
- The clean experimental trick of wiping conversation history while leaving the environment intact, which isolates the persistent-state effect from ordinary in-session memory carryover
- Why attack success multiplies across three independent gates — write, reload, and activation — and how that explains the counterintuitive result that the model with the lowest write rate ends up the most exploitable
- The jewel finding: injecting a false fact activates essentially 100% of the time across all three models, while overriding a user's stated preference almost never works — because facts swim with the model's instinct to trust its context and preference overrides swim against it
- Why disguising a payload as a legitimate business policy dramatically boosts the write rate but barely moves end-to-end success — revealing that the write gate and the execution gate are two genuinely different checks
- The honest limits: it's a hand-built benchmark on three models, the headline 32–42% success rate depends heavily on the sandbox's write policy, and the paper tests exactly zero defenses
Chapters
- 00:00The attacker who isn't in the room
- 02:41Why language models can't tell orders from data
- 05:23The cross-site scripting parallel
- 08:04Context as a pipeline, and which channels persist
- 10:46The session-reset experiment
- 13:27The three-leg relay race
- 16:09Why false facts win and preference overrides lose
- 18:50Disguise, and which gate it fools
- 21:32Honest limits and what's left open
- 24:13Why it matters now
References in this episode
- Universal and Transferable Adversarial Attacks on Aligned Language Models — Connects to the episode's 'swimming against the current' insight about activatio
- Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection — The foundational indirect prompt injection paper that establishes the 'reflected
Full transcript
Also available as a plain-text transcript page.
0:00Cassidy: Your AI agent keeps notes. It remembers your preferences, it loads a little file of standing instructions every time it wakes up, it has a memory of past conversations it can pull from. That's the whole pitch of an agent over a plain chatbot — it doesn't start from scratch every time. Now here's the unsettling version of that same sentence. So does the attacker who slipped something into that memory last Tuesday. They're long gone. They don't need to be in the room when the damage happens. The instruction they planted just waits.
0:34Tyler: And "waits" is the word that should make anyone building these systems sit up. Because for the entire short history of prompt injection, the comforting assumption has been that the attacker has to be present — that when the session ends, the bad instruction evaporates with it.
0:52Cassidy: That assumption is exactly what's quietly dying, and it's the subject of the paper we're digging into today. It went up on arXiv on June third, twenty-twenty-six, and we're recording one day later, on June fourth. Quick note before we get into it: this episode is AI-generated. The script was written by Anthropic's Claude Opus 4.8, and the two voices you're hearing — I'm Cassidy, and my co-host is Tyler — are both AI voices from Eleven Labs. The producer isn't affiliated with Anthropic or with Eleven Labs. The paper itself is called "What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems." And that title is basically the whole thesis as a question — what if the injection never left?
1:40Tyler: It's a great title, honestly, because it tells you the field has a blind spot and names it in seven words.
1:47Cassidy: So let me set the baseline, because to feel why this paper matters you have to feel the old shape of the problem first. Prompt injection, at its core, comes from one stubborn fact about language models: the model reads everything it's given as one continuous stream of text. It has no reliable internal stamp that says "this part is a trustworthy order from my operator" and "this part is just data I'm supposed to look at." It's all the same river of words.
2:16Tyler: Right, and the picture I always reach for is a contractor on a job site. They open the blueprint folder, and they build whatever's in the folder. If someone slips a sticky note into that folder that says "actually, put the staircase here" — the contractor can't tell the difference between the architect's plan and the sticky note. It's all in the folder, so it's all the plan.
2:39Cassidy: That's exactly it. And prompt injection is just the art of slipping sticky notes into the folder. Hide an instruction inside a web page the agent reads, or a document it retrieves, or a tool's output — and the model follows it as if it came from its rightful boss. But — and this is the key — historically the damage was contained. The malicious instruction lived in that one session's context window. Session ends, context clears, sticky note's gone. The attacker had to be there, more or less, at the moment of exploitation.
3:12Tyler: So what changed? Because that containment sounds like a real limit on how much harm you can do.
3:19Cassidy: What changed is that agents stopped being stateless. And the authors reach for a historical parallel that I think is the single best thing in the paper — the web went through this exact transition twenty years ago, with cross-site scripting.
3:34Tyler: This is the analogy that's going to do most of the heavy lifting, so let me lay it out, because a lot of our listeners will already half-know it. Cross-site scripting, XSS, comes in two flavors. The first is reflected XSS. An attacker crafts a poisoned link, lures you into clicking it right then, and their script bounces back to you in that single request. The attacker has to catch you in the act. It's a "right now" attack.
4:02Cassidy: Which maps perfectly onto classic prompt injection. Attacker present, single session, damage contained.
4:09Tyler: And then there's stored XSS, which is the nastier cousin. Instead of bouncing a script back in one request, the attacker saves it — into the website's database. Say they drop it into a comment field on a forum. Now that malicious script runs automatically for every future visitor who loads that page. The attacker plants it once and walks away. The damage detonates later, over and over, for people who have nothing to do with the original attacker.
4:39Cassidy: And stored XSS haunted the web for over a decade. It's not a solved-overnight kind of problem — it's a structural shift in who the victim is and when the attack fires. The paper's entire animating worry is that AI agents are walking up to that same line right now. We've got the reflected version well-documented. Almost nobody is studying the stored version.
5:03Tyler: So that's the move. Take prompt injection, and decouple the moment of injection from the moment of exploitation. The authors give it a name — cross-session stored prompt injection. And the reframing underneath the name is the part I want to make sure lands, Cassidy, because it's subtle.
5:22Cassidy: Go for it.
5:23Tyler: The reframing is that prompt injection in agents isn't only a problem of malicious inputs anymore. It's a problem of state contamination. The dangerous moment isn't when bad content arrives at the door. It's when stored content gets reloaded — possibly days later, into a completely different person's session. The attack happens in two separate, loosely coupled steps. First, an unsafe write — untrusted content crosses the boundary into persistent storage. And then, much later, a reactivation — the agent's totally normal startup routine loads that contaminated state back into a live session.
6:02Cassidy: And I want to underline how little the attacker needs for this. They don't need privileged access. They don't need to control the victim's query — the victim is some other user entirely. They don't need to be present. All they need is what the authors call a write primitive — some way to get content from the untrusted side into a channel that persists. After that, they just rely on the agent doing its job normally. The agent reloads its own memory, and the poison comes along for the ride.
6:35Tyler: Which means the security boundary has moved. It used to be the model. Now it's the whole pipeline that assembles what the model sees.
6:45Cassidy: That's the conceptual heart of it, and it's worth slowing down on. The authors say: stop thinking about an agent as "a model reading a prompt." Start thinking about it as a pipeline that assembles a prompt. Every turn, the agent stitches together a working context from a bunch of different sources — your current query, the standing instructions, the conversation history, descriptions of its tools, retrieved documents, files in its workspace. And the crucial property that separates those sources isn't what they contain. It's how long they live.
7:21Tyler: Lifetime. Some of those channels are ephemeral — this conversation, gone when you close the tab. And some persist — the memory, the user profile, the files, the auto-loaded instructions.
7:34Cassidy: And the persistent ones are the new attack surface. They're the database in the XSS analogy. But here's the distinction the listener really needs to hold onto, because it turns out to be the single best predictor of whether one of these attacks actually works. Among the persistent channels, some get automatically reloaded into context every single session — and some only get pulled in if the agent happens to decide to go fetch them.
8:02Tyler: The note-on-the-monitor versus the note-in-the-drawer.
8:05Cassidy: Exactly that. A note taped to your monitor — you cannot miss it. Every morning you sit down, there it is, in your face. That's a directly-loaded channel: working memory, or one of these convention-based instruction files that agents read at startup. Things with names like an AGENTS file that the harness loads automatically. Inject something there, and it's guaranteed to reappear, no trigger needed.
8:31Tyler: And the note in the drawer is archival memory — stored, sure, but only opened if the agent decides to open that drawer. Conditional loading. There's an extra point of uncertainty: will the agent even bother to retrieve it?
8:45Cassidy: And the paper's clean finding is that the taped-to-the-monitor channels beat the in-the-drawer channels, consistently, for exactly the reason you'd guess — every point where the agent has to choose to do something is a chance for the attack to fizzle. So their slogan is that strongly persistent, directly-loaded context is the primary risk surface. The stuff that auto-loads is the stuff to worry about.
9:11Tyler: Okay. So that's the threat laid out conceptually. But this is where I get itchy, because everything we've said so far is a story. A plausible story, a well-told story — but a story. The thing that makes this a paper and not a blog post is whether they can actually isolate the cross-session effect and measure it. And that hinges on one piece of experimental design that I think is genuinely clever.
9:37Cassidy: This is your stretch — walk us through it.
9:40Tyler: So the obvious objection to all of this is: how do you know the agent's bad behavior came from the poisoned persistent state, and not just from ordinary conversation carryover within a session? Agents remember things within a conversation all the time. That's not an attack, that's just memory working. You need to rule it out.
10:01Cassidy: Right, otherwise you've just shown agents pay attention to their own context, which — nobody's surprised.
10:08Tyler: So here's the design. They build a sandbox where the environment persists across sessions. In the first session — the injection session — an attacker query tries to write a payload into persistent storage. Memory, a file, a tool description, whatever channel they're testing. Then they do the key thing: they wipe the conversation history. Fresh session, empty context window. But — and this is the move — they leave the environment intact. The memory, the files, all of it stays. Then they submit a clean, innocent victim query and just watch what the agent does.
10:44Cassidy: And because the conversation history is gone, any influence they see can only have come from the persistent state. There's no other path. You've severed the ordinary-carryover explanation entirely.
10:57Tyler: That's the methodological heart of the whole thing. The session reset is what makes the numbers trustworthy. Without it, you've got a muddle. With it, you've got a clean isolation of exactly the effect you care about. They run this across three realistic domains — e-commerce, travel booking, financial portfolios — with a hundred and sixty-two hand-crafted attack cases, and they vary things like what the attack is trying to achieve and how the payload is dressed up.
11:27Cassidy: And to score whether an attack worked, they use a rule-based checker plus a second model acting as a judge, with a human stepping in when those two disagree. Belt and suspenders on the verification.
11:40Tyler: Which matters for credibility, but the part I actually want to spend time on is how they score success — because they don't just ask "did it work, yes or no." They break success into three gates, and the way those gates combine is the analytical spine of the whole paper.
11:58Cassidy: This is the one bit of math really worth slowing down on, and Tyler, I think the relay race is the way in.
12:05Tyler: It really is. For an attack to succeed end to end, the payload has to clear three independent stages. One — it has to actually get written into persistent storage in the first place. Two — after the session resets, it has to get reloaded back into the agent's context. And three — once it's in there, it has to actually change the agent's behavior toward the attacker's goal. And the success rate for the whole attack is those three rates multiplied together.
12:36Cassidy: So picture a relay race with three legs. Drop the baton on any single leg and the whole race is lost. It doesn't matter how fast your first runner is if your third runner falls over. And because the legs multiply, a weak link anywhere collapses the entire attack.
12:54Tyler: And here's why that decomposition isn't just bookkeeping — it's that the three legs turn out to be largely independent of each other. A model can be fantastic at one leg and hopeless at another. Which means you genuinely cannot predict the end-to-end number from any single stage. And the data shows this beautifully. One of the models they test — I'll just say all three are recent models, two of them you'd recognize the families — one of them has the highest write rate of the bunch, over eighty-six percent. Best first runner in the race. But it's not the most vulnerable overall, because it stumbles badly on the later legs.
13:36Cassidy: So who ends up most vulnerable?
13:39Tyler: The model with the lowest write rate. Around sixty-four percent — worst first runner. But it's got the strongest downstream legs, so it ends up the most exploitable overall, at forty-two percent end to end. The multiplication is the only thing that explains that inversion. And it's not just an academic point — it tells a defender exactly which gate to go harden on their specific system. You don't fight "prompt injection" in the abstract. You find your weak leg.
14:12Cassidy: That's a genuinely useful reframing — the metric points you at where to spend your effort. And overall, across all three models, end-to-end success ran between thirty-two and forty-two percent. These are not exotic edge cases. A third of the time, roughly, a planted instruction survives the whole chain and hijacks a future session.
14:35Tyler: Now, hold that overall number, because we're going to come back and pressure-test it. But first — the result that I think is the actual jewel of the paper.
14:46Cassidy: This is the one. So they varied what the attack was trying to do, and two categories sit at opposite extremes. The first is fact manipulation — getting the agent to believe and repeat a false fact. Like planting "this product has a five-star safety rating" when it doesn't, or a false exchange rate, or a made-up policy detail. The second is preference manipulation — getting the agent to override an explicit preference the user has stated. The user says "I only want window seats," and the attack tries to make the agent book aisle seats anyway.
15:25Tyler: Same attack machinery. Same persistence channels. Same everything.
15:29Cassidy: Same everything. And the outcomes are night and day. Fact manipulation succeeds roughly three-quarters to four-fifths of the time, end to end. And here's the number that stopped me — once a false fact actually makes it into the context, the agent repeats it as truth essentially always. A hundred percent activation, across all three models. Once it's in there, it just becomes part of what the agent thinks is true.
15:59Tyler: And preference manipulation?
16:01Cassidy: Almost never works. Somewhere between zero and eleven percent. Same machinery, opposite result.
16:08Tyler: So why? Because if the persistence and reloading are identical, the gap has to be about what happens at that third leg — activation. What is the model actually doing differently?
16:20Cassidy: And this is the "oh, of course" insight, the one that rewards you for paying attention. Think about the model's default instinct. When something is in its context, its baseline behavior is to treat that as ground truth. That's what context is for — it's the model's picture of the world. So when you inject a false fact, you're not asking the model to do anything against its nature. The fact just slips into its world model and becomes part of the furniture. No decision has to change. You're swimming with the current.
16:54Tyler: Whereas a preference override is swimming against it.
16:57Cassidy: Hard against it. A preference override is asking the model to fight its own live decision rules and the explicit wishes of the user who is right there in the current session. And the model tends to honor the live user over some stored instruction it doesn't even know is malicious. So the attack that aligns with the model's instincts wins overwhelmingly, and the attack that fights them drowns. Attacks that swim with the current win. Attacks that swim against it drown.
17:28Tyler: And I love this finding partly because it inverts your intuition about danger. You'd think "make the agent take a bad action" is the scary one. But the most reliable attack is the quiet one — just changing what the agent believes. No alarms, no refused commands. It just calmly tells the next user something false, with total confidence.
17:50Cassidy: And that's the thing about a contaminated world model — it doesn't look like an attack from the inside. The agent isn't being disobedient. It's faithfully reporting what it "knows."
18:02Tyler: There's one more empirical wrinkle that I think sharpens the whole picture, and it's about disguise. So they tried two flavors of payload. A blunt one — just a naked command, "do this." And a dressed-up one, where the malicious instruction is disguised as a legitimate-looking business policy.
18:21Cassidy: And the disguise helps — but it helps at a very specific place, which is the tell.
18:26Tyler: Right. Dressing the payload up as a real policy dramatically improves whether it gets written into storage in the first place. There's a case where, for a preference attack on one of the models, disguise pushed the write rate from around eighteen percent all the way up to over eighty percent. Huge jump at the write gate. But — the end-to-end success barely budged. It crept from under four percent up to around eighteen — still tiny.
18:54Cassidy: Because activation was still the wall. The disguise got the payload past the bouncer at the door, but once it was inside, the model's actual judgment about whether to override the user was unchanged. And that's a really clean diagnostic — it tells you the write step and the execution step are two genuinely different checks, and that disguise is a write-gate trick, not an execution-gate trick.
19:19Tyler: And the flip side of that, which the authors are honest about, is encouraging for defenders. It means the model's existing alignment is already buying you something at the write gate. It's not a wide-open door. The write step is the real bottleneck — which is exactly where you'd want to spend defensive effort.
19:39Cassidy: Which is a good pivot, Tyler, into being honest about what this paper is and isn't. Because it would be easy to walk away from "thirty-two to forty-two percent" feeling like the sky is falling, and the authors themselves are refreshingly careful not to oversell.
19:56Tyler: Yeah, let me take the critique, because the paper basically hands you the materials for it. First and most important — this is, by its own description, a position paper with a hand-built benchmark. All one hundred sixty-two cases were meticulously crafted by domain experts. That gives you clean, controlled, interpretable tests. But it also means the difficulty distribution reflects the authors' own design choices. We don't really know how representative these cases are of attacks in the wild, or how the numbers would hold up against payloads the authors didn't think to try.
20:33Cassidy: And the write gate point cuts both ways, doesn't it.
20:36Tyler: It does, and this is the one I'd push hardest. The paper finds the write step is the main bottleneck. But how often a write succeeds depends entirely on how permissive their simulated agents are about committing untrusted content to memory — and that's a knob they set in the sandbox. If real, deployed agents are more conservative about what they're willing to write into persistent state — and a well-engineered one should be — then that headline thirty-two to forty-two percent could shrink a lot. The number is real, but it's a number about their sandbox's write policy as much as about the models.
21:14Cassidy: That's fair. Though I'd offer a partial defense — the cross-model contrasts and the fact-versus-preference gap don't depend on the absolute write rate. Those are relative findings. Even if you dialed the whole write gate down, fact injection would still activate essentially always and preference override would still mostly fail. The structure of the result is sturdier than the headline percentage.
21:39Tyler: Agreed, and that's the right way to read it. A couple more honest flags. It's three models, one snapshot in time. The finding that "fact injection always activates" is striking — but it's a property of how today's models are trained to trust their context. That's exactly the kind of behavior targeted training could change, which means this specific result could date quickly. That's not a knock, it's just the nature of measuring a moving target.
22:09Cassidy: And the harm category with the scariest implications is also the shakiest evidence.
22:14Tyler: Right — they have a third category beyond fact and preference, around expanding what actions the agent is allowed to take. Data exfiltration, code execution, the genuinely dangerous stuff. And that category shows the biggest variance across models and the least decisive results. So the most alarming harm type is the one we know the least confidently about. I'd have liked more there.
22:39Cassidy: And the elephant in the room — no defenses are tested at all.
22:43Tyler: None. They propose "secure context management" as a principle — treating the assembly of context as a first-class security concern in how you engineer the agent harness — but they evaluate exactly zero mitigations. So the most practically urgent question, which is "how much of this thirty-two to forty-two percent survives even simple write-time filtering or tracking where content came from," is left completely open. The paper names and measures the danger. It very deliberately does not solve it.
23:16Cassidy: And to their credit, they say that out loud. They frame it as a step toward systematic, system-level threat modeling — not a finished map. They even note the ecosystem is evolving so fast that their taxonomy is a snapshot future systems will outgrow, and probably outgrow in the direction of more channels and more risk, not less. And in their ethics statement they're explicit that they're introducing no new attack techniques — they're characterizing risk that's already baked into how these systems are built, to inform safer designs.
23:51Tyler: Which is the honest posture for this kind of work. It's an early-warning paper. Stake out the problem, hand the community a vocabulary and a yardstick, and argue the case is worth taking seriously now.
24:03Cassidy: And that's really why it matters, when you step back. The old defensive instinct for prompt injection was "filter the bad input at the moment it arrives." This paper's argument is that for stateful agents, that posture is structurally insufficient — because the dangerous moment isn't arrival, it's reload. So your defenses can't just sit at the input gate. You need them at the write gate — what's even allowed into persistent state — and at the incorporation gate — what gets auto-loaded back in without anyone looking.
24:37Tyler: And the actionable takeaways fall right out of the findings. If you're building one of these things — auto-loaded instruction files are your highest-risk surface, so scrutinize what's allowed to live there. Fact-injection is your most reliable attack to defend against, because it activates silently and always. And the write step is where your model's alignment is already doing some work, so it's worth reinforcing rather than assuming it'll hold on its own.
25:06Cassidy: The deeper thing, though — and this is what the XSS analogy really carries — is the inflection point itself. The web community learned, slowly and painfully and over more than a decade, that stored attacks are a fundamentally different animal from reflected ones. They scale to every future visitor. They don't need the attacker present. And we paid for that lesson in real breaches before we internalized it.
25:33Tyler: And the bet this paper is making is that agentic systems are standing at the exact same fork — right now, before the stored case becomes endemic. The whole point of naming it early is to not relive the decade.
25:46Cassidy: Which is a strange kind of optimism, really. The most useful thing a paper like this can do is be a little bit alarming before the alarming thing is common. If it works, the result is that nothing dramatic happens — because somebody hardened the write gate before there was a headline.
26:06Tyler: Prevention rarely gets a victory lap. But that's the genre this is, and it knows it.
26:11Cassidy: That's a good place to leave it. The one thing to carry out of this episode: the moment an agent gains durable memory, the attacker stops needing to be in the room. The instruction they plant just waits — and the agent's own routine of reloading its memory is what eventually pulls the trigger.
26:30Tyler: The show notes have a link to the paper and some related reading if this one caught you — worth it if you build with agent memory or installable tools.
26:40Cassidy: And if you want to go further, paperdive.ai has the full transcript with every technical term defined inline, plus the concept pages that link this episode over to the others we've done on agent security. This has been AI Papers: A Deep Dive. Thanks for listening.