All episodes

Episode 102 · Jun 01, 2026 · 24 min

How to Catch an AI Attack That No Single Conversation Reveals

Brown, Bhargav, Santhanam et al.

AI Safety

AI Papers: A Deep Dive — Episode 102: How to Catch an AI Attack That No Single Conversation Reveals — cover art

paperdive.ai

Listen

Ep. 102

How to Catch an AI Attack That No Single Conversation Reveals

0:00

24 min

Concepts in this episode

AI Safety Agentic AI Multi-Agent Systems Agentic Misalignment Prompt Injection Task Decomposition Behavioral Fingerprinting LLM-as-Judge AI & Security Agent Scaffolding Agentic Coding Silent Failure

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Stateful Online Monitoring Catches Distributed Agent Attacks

Venue

arXiv:2605.31593

Year

2026

Read the paper

arxiv.org/abs/2605.31593

Also available on

Apple Podcasts Spotify

An attacker can split a dangerous task into pieces so innocent that no safety monitor watching any one conversation will ever flag it — and a real espionage campaign already used exactly that trick. This episode digs into a paper that builds that distributed attack for real, then engineers a monitor that catches it live across millions of conversations, and discovers it accidentally catches ordinary jailbreaks too. You'll come away understanding why per-conversation monitoring is structurally blind, and what it takes to watch the whole building instead of one room.

What you'll take away

Why a per-transcript safety monitor is architecturally unable to catch a distributed attack — it's not a weak lock, it's a lock on a door the attacker walks around
How a model that scores zero percent on hard cyber exploits jumps to ~37-40 percent by coordinating stronger 'subagents' who each see only a disguised sliver of the task
The stream-clustering and 'cheap glance' design that makes cross-conversation monitoring feasible at datacenter scale — catching ~85 percent of campaigns, ~30 percent earlier, with fewer false alarms
The two failure modes a six-person, two-month red team found: renaming files to scatter evidence across clusters, and innocent-but-ambiguous traffic lingering as false positives
The unplanned discovery that the same monitor catches ordinary jailbreaks for free, because real attackers iterate across many accounts and leave a cross-session fingerprint
Why the defense and its deepest weakness are the same fact — it works because disguised pieces smell alike, and fails the moment a patient attacker changes the smell

Chapters

00:00The espionage campaign nobody could see in real time
02:57Sequential vs. distributed attacks
05:55Proving the attack actually does harm
08:53The scale constraint that breaks the naive defense
11:51Stream clustering and the cheap-glance design
14:49The results: more attacks, caught sooner, fewer false alarms
17:46Red-teaming and the accidental jailbreak discovery
20:44Limits, honest caveats, and what it reframes

References in this episode

Sabotage Evaluations for Frontier Models — Anthropic's framework for evaluating whether models can subvert oversight, direc
AI Control: Improving Safety Despite Intentional Subversion — The paper that formalized using monitors and protocols to catch misbehavior even

Full transcript

Also available as a plain-text transcript page.

0:00Cassidy: In late twenty twenty-five, Anthropic published a report that should have gotten more attention than it did. Someone had used Claude Code — the coding agent — to run an espionage campaign against thirty targets around the world. Thirty global victims. And the striking part wasn't that an AI was involved. It was how they hid it. The attackers never asked the model to do anything obviously criminal. They stood up an external orchestrator — a controller sitting outside the model — that chopped the whole operation into small, innocent-looking pieces. Map this system. Validate these credentials. Check this host. Each individual request looked like routine IT work, the kind of thing a DevOps engineer types a hundred times a day. And the safety monitors watching those conversations? They saw nothing wrong, because in each conversation, nothing was wrong.

0:54Tyler: So how did anyone catch it at all?

0:57Cassidy: After the fact. By hand. Investigators went back through the logs and manually stitched the separate sessions together until the shape of the campaign appeared. Which is the whole problem — by the time you've reconstructed it, the operation is finished. The files are gone. And that after-the-fact stitching is exactly where this paper picks up. It went up on arXiv on May twenty-ninth, twenty twenty-six, and we're recording three days later, on June first. Quick ground rules before we dig in: this episode is AI-generated. The script was written by Anthropic's Claude Opus 4.8, and the two voices you're hearing — I'm Cassidy, and my co-host here is Tyler — are both AI voices from Eleven Labs. The producer has no affiliation with Anthropic or with Eleven Labs. The paper is called "Stateful Online Monitoring Catches Distributed Agent Attacks," and its question is the obvious next one: could you do that stitching automatically? In real time, across millions of conversations, while the attack is still happening — and without grinding everything to a halt for the ninety-nine percent of users who are doing nothing wrong?

2:07Tyler: And there are really two papers stapled together here, which is why it's worth a full sitting. The first half builds the attack — an actual working version of that distributed espionage thing, not a thought experiment. The second half builds the defense. And then there's a third act, where they spend two months trying to break their own defense and stumble onto something they weren't looking for.

2:32Cassidy: Right. Let's start with the attack, because the defense only makes sense once you feel why the standard approach is structurally doomed. And the key idea — the thing that, if you miss it, nothing else lands — is the difference between two ways of breaking up a harmful task. Here's the version that gets caught. Imagine a thief who walks into one room, in front of one guard, and runs the whole sequence: picks the lock, disables the camera, cracks the safe. The guard watching that room sees every suspicious move in order. Easy. In the AI world, that's what's called a sequential decomposition attack — you break a harmful task into steps, but you run all the steps in a single conversation. Prior research showed those get caught almost every time, usually while the attacker is still just getting started. Now the version that doesn't get caught. Same thief, completely different strategy. Instead of doing it himself, he hires ten contractors and gives each one an innocent job. One surveys the building, for a renovation. One copies a key, for the new tenant. One services the alarm system, for routine maintenance. No single contractor sees a crime. No guard watching any single room sees a crime. The thief assembles the burglary alone, off-site, in his head. That's the distributed attack. And it doesn't slip past the guard — it's invisible to the entire idea of having a guard in a room.

4:03Tyler: And that's the part that's genuinely structural, not just clever. The safety monitor — the watchdog model that reads conversations and scores them for harm — reads exactly one transcript at a time. That's not a bug somebody forgot to fix. It's the basic design. It has no memory of the other conversations happening right now. So a per-transcript monitor against a distributed attack isn't a weak lock. It's a lock on a door the attacker walked around.

4:33Cassidy: That's the whole game. The authors even put it bluntly — standard monitors are architecturally unable to identify distributed misuse, because they assign a suspiciousness score to one narrow transcript at a time.

4:48Tyler: Okay, but here's where my skeptic hat goes on, Cassidy, and I think it's the right instinct. It's cheap to build an attack that hides harm. Anyone can split a task into pieces that look innocent. The hard question is: can the pieces, reassembled, actually do anything? Or is this a strawman — a disguise wrapped around a task that never gets solved?

5:12Cassidy: And the authors clearly anticipated exactly that objection, because they spend real effort proving the attack does harm, not just hides it. They take a hard subset of a cybersecurity benchmark — these are capture-the-flag exploits, genuinely difficult — and they run a model called gpt-oss-120b on them directly. Solve rate: zero percent. It just can't do these tasks. Then they wrap that same weak model in the distributed scaffold. Now it's a supervisor. It holds the full harmful objective, and it farms out narrow disguised sub-tasks to stronger subagent models — run this tool on this binary, decode any encoded strings in this script — framing each one as ordinary debugging. The subagents never see the real goal. And the solve rate jumps from zero to roughly thirty-seven, forty percent.

6:08Tyler: Let me make sure I've got the weight of that. The supervisor is too weak to solve the problem itself. Zero percent. But by coordinating stronger helpers who each see only a sliver, it gets to forty.

6:22Cassidy: That's it. And the unsettling implication is the one to sit with. You don't need a powerful, dangerous model to accomplish a dangerous task. You need a mediocre coordinator and access to capable, compartmentalized helpers. It's the project-manager move — a manager who couldn't personally solve the engineering problem still ships it, by breaking it up and handing pieces to specialists who never see the whole picture. Except here the disguise is doing active work to keep each specialist from realizing what they're building.

6:55Tyler: And then the number that actually matters for the defense story. How often does the standard monitor catch this thing?

7:02Cassidy: About twenty percent of the time. A fifth as often as it catches those old single-context attacks, which it catches essentially always. And the distributed version gets to take roughly twice as many actions before anyone flags it. So you've degraded the monitor's recall by somewhere between thirty and eighty percent, and you've bought the attacker twice the runway. That's the gap. That's the hole the second half of the paper is trying to fill.

7:30Tyler: So let's go there, because the defense is where the engineering gets elegant. And I want to start with the constraint, because the constraint is brutal and it's what makes the naive answer fail. The obvious idea is: fine, if no single conversation looks bad, then look across conversations. Group the related ones, reason over the group. Intuitive. But think about scale. You're a model provider. You've got hundreds of thousands, millions of conversations flowing at once. If every time a new request arrives you re-cluster everything against everything you've ever seen, the cost grows with the square of the number of conversations. At datacenter scale, that's not slow — it's impossible. You'd need an army of expensive AI judges reading everything, constantly, forever.

8:19Cassidy: And on top of the cost, there's a timing constraint people gloss over. The forensic approach — the after-the-fact stitching that caught the real espionage campaign — that works, but it's a postmortem. You need this to fire synchronously. While the attack is in progress. Catch it at turn ten, not after the exploit's done at turn thirty.

8:41Tyler: Exactly. So three things at once: find the rare needle in a haystack of benign prompts, do it live, and don't tax the innocent majority. And the trick that makes it possible is borrowed from an old corner of data mining — stream clustering. Here's the intuition. Forget re-sorting the whole pile every time. Instead you keep a small handful of running summaries — call them cluster centroids. Each one is a compact description of a group of similar conversations you've seen. When a new request arrives, you turn it into an embedding — a point in space where similar-meaning text lands nearby — and you do one fast comparison: which of my few summaries is this closest to? Drop it in the nearest one, nudge that summary a little to account for it, done. You're comparing against a dozen buckets, not a million transcripts.

9:36Cassidy: And the bet underneath the embeddings is the load-bearing one. The claim is that even when the attacker disguises each subtask in totally different words — different files, different vocabulary — the underlying exploit leaves a consistent technical residue. The same tools, the same target, the same shape of operation. And that residue survives into the embedding. So the disguised pieces still drift toward the same bucket. It's like a faint, distinctive smell that lingers on every contractor no matter how they dress for the job. Separately, each one is innocuous. But a watcher who can smell across all the rooms at once notices the same chemical everywhere.

10:20Tyler: That's the right image. And now the part that makes the whole thing feel clever instead of just complicated — the cost asymmetry. Think about airport security. You can't deep-interview every passenger; the line would never move. So everybody gets a quick, near-instant glance, and only the rare traveler whose pattern looks off gets pulled aside for the slow, thorough interrogation. The monitor works exactly like that. Every request gets the cheap glance — a fast local suspicion score, a fast routing into a bucket. Pennies. The expensive part — an AI judge that reads across all the related conversations in a cluster and asks "is this a coordinated campaign?" — that fires only rarely. Only when a bucket has quietly accumulated enough suspicious weight. So you pay the cost of deep reasoning only on the tail, only when a pattern actually forms.

11:22Cassidy: And the escalation rule there is worth one extra sentence, because it's subtle. The cluster doesn't wait for a single smoking-gun transcript. It lets several mildly-suspicious, related conversations add up. No individual drop is damning, but enough drops fill the bucket and trip the alarm.

11:43Tyler: With one wrinkle that turns out to matter a lot later — it's not just raw volume. The rule also requires the drops come from enough distinct conversations. Enough different sources, not just one chatty session repeating itself. Hold that thought, because it's the seam the third act pries open.

12:04Cassidy: So does it work? Tyler, give me the headline, because the numbers are the payoff here.

12:10Tyler: They build realistic datacenter traffic — they take the distributed attacks and bury them in a massive sea of genuinely benign coding-agent and chat conversations. At one ratio, two hundred seventy-two harmful conversations hidden among twenty-seven thousand innocent ones. And then they compare the stateful monitor against the standard one-at-a-time monitor. The clean comparison: in the harder setting, where benign traffic outnumbers attacks a hundred to one, the stateful monitor catches around eighty-five percent of campaigns and flags them about thirty percent earlier — it cuts the attacker's runway down to roughly twenty-five turns instead of thirty-seven. And it does that while raising fewer false alarms than the standard monitor would need to even get close. More attacks, caught sooner, with fewer innocent users wrongly flagged.

13:08Cassidy: Which is the rare thing in this space — it's not a trade-off. Recall, false positives, and time-to-detect usually pull against each other. You catch more by being trigger-happy, but then you annoy innocent people. Improving all three at once is the real result.

13:27Tyler: And the latency stays flat for the ordinary user, because of the cheap-glance design. They actually confirm in an appendix that the observed costs track the cheap linear routing, not the catastrophic quadratic rebuild. The system does what it was designed to do — pennies on every request, the expensive interrogation almost never.

13:50Cassidy: So that's acts one and two. Attack: real, dangerous, invisible to current monitors. Defense: a monitor that reasons over groups instead of single transcripts, and does it fast enough to be feasible. Now the third act — which is my favorite, because they went looking to break it and found something they weren't expecting. They ran a red team. Six people, two months, white-box access — meaning they could see everything: the cluster assignments, the suspicion scores. They could reset state, open fresh accounts, whatever. Over a hundred thousand attacks across nearly three thousand sessions. The point was to harden the defense. And it did — they found two real failure modes, which we should talk about because they're honest and they're instructive.

14:39Tyler: The first one is the obvious adversarial move, right? If the defense relies on disguised pieces smelling the same, then mask the smell.

14:48Cassidy: Exactly that. The red-teamers found that if you simply rename files and swap vocabulary across your sub-tasks, you can split what should be one cluster into several nearby clusters. And once your evidence is scattered across separate buckets, no single bucket crosses the alarm threshold. You delay review. The fix the authors apply is to assign a borderline request to a set of nearby clusters instead of just the single closest one — so the evidence can still pool. But it's a patch, and they're upfront that it's a patch. The second failure mode is almost the opposite problem. Ambiguous, kind-of-suspicious-but-actually-innocent examples could linger too long in a cluster's memory and keep generating false positives — the system staying suspicious of a pattern that turned out to be fine.

15:39Tyler: And both of those, I think, point at the deepest tension in the whole paper, which is worth saying plainly. The defense works because the attacker is a little predictable. It exploits a regularity — shared features survive disguise. And the rename trick is exactly a demonstration that a careful enough attacker can erase that regularity. So the honest framing throughout isn't "this closes the door." It's "this raises the cost of attack." The authors say it themselves: stateful monitoring helps to catch, but does not solve, distributed misuse.

16:19Cassidy: Which is the right place to bring in the surprise, because it cuts the other way. While they were red-teaming, they noticed the monitor was catching things it wasn't built for. Ordinary jailbreaks. The kind where someone iterates on a prompt trying to get the model to misbehave.

16:38Tyler: Wait — why would a monitor built for distributed cyberattacks catch a plain jailbreak? Those don't have subagents or split tasks.

16:48Cassidy: Because of how real attackers actually behave. When someone is trying to jailbreak a model, they don't get it on the first try. They iterate. They tweak a variant, try again, tweak, try again — and at scale, they spread those attempts across many accounts to avoid per-account rate limits and bans. And that behavior leaves exactly the fingerprint this monitor is built to see: the same attack variant, repeated, across many separate sessions. So the population-level view catches it for free. One category they tracked — a kind of attack that tries to poison the monitor's own context — it caught in eighty-two percent of sessions. Those old sequential attacks we talked about at the top? Twenty-nine out of twenty-nine.

17:40Tyler: And here's the detail that ties it back to that wrinkle I flagged earlier — the "enough distinct sources" rule. They show the flagging rate climbs as related activity accumulates across different sessions. But it plateaus when the activity piles up within a single conversation. Which is the proof that the value is specifically the cross-account view. It's not just "lots of suspicious text." It's "the same suspicious thing showing up in many places at once." That's a different signal, and it's the one nobody was harvesting.

18:15Cassidy: It reframes what a safety monitor even is. The field has spent years asking "is this conversation dangerous?" — a content question, one transcript at a time. This is asking "is this pattern across many users dangerous?" Which is much closer to how we think about spam, or credit-card fraud, or tracking an outbreak. The signal lives in the aggregate, not in any single event.

18:41Tyler: Let me push on the limits, though, because the paper is honest and the episode should be too. And the biggest one is buried in their own results, Cassidy. The whole thing rests on attack fragments leaving enough repeated, semantically related evidence to cluster. And they show the detection advantage persists but narrows as the benign background grows. Strong at ten-to-one. Meaningfully weaker at a hundred-to-one. Real datacenter traffic isn't a hundred-to-one benign. It's vastly more lopsided than that. And the paper doesn't establish where the advantage disappears entirely. In a big enough, diverse enough sea of innocent traffic, a true attack cluster might just drown. The signal's still there in principle — but if it's three conversations out of fifty million, can you still pull it out? That's an open question, and it's the one I'd want answered before calling this deployable.

19:42Cassidy: That's fair, and I'd add two more honest caveats. One — the benign background is simulated. It's realistic traffic, but it's a benchmark, not a live production firehose. And two, a smaller but real point: the local monitor and the cross-context judge are the same model family. So some unknown fraction of the result could ride on that particular monitor's quirks rather than the general principle. The improvement is real, but it's measured against one configuration.

20:12Tyler: And the "first distributed agent attack" framing — I want to be precise there, because the authors are. They don't claim these attacks didn't exist. The real espionage campaign proves they did. The novelty is the first academic implementation and evaluation — a built, measurable version you can study and defend against. Which is a genuine contribution, but it's a different claim than "first ever," and they keep that distinction clean.

20:40Cassidy: One more piece of intellectual honesty worth flagging, because I think it's a mark of a serious paper. They treat those capture-the-flag exploit tasks as disallowed for the purposes of evaluation — something the monitor should refuse. And they explicitly say that's an evaluation convenience, not a claim that those tasks should always be refused in the real world. They're not smuggling a policy judgment in through the benchmark. They just needed a clean line between allowed and disallowed to measure against.

21:13Tyler: So where does that leave us. Honestly?

21:16Cassidy: It leaves us with a strong proof-of-concept that reframes a problem, not a solved problem. The real contribution, to me, isn't the idea of looking across accounts — that's intuitive, the forensic investigators were already doing it by hand. The contribution is showing it's computationally feasible to do it synchronously, at scale, while the attack is still happening, without taxing the people who aren't attacking anything. That feasibility is the news. The cheap glance feeding a shared memory — that's the engineering insight that turns a nice idea into something you could actually run.

21:55Tyler: And the arms-race framing is the honest backdrop for all of it. As defenses get good at catching obvious harm, attackers hide it somewhere less obvious — and distribution, spreading the operation thin until no piece looks wrong, is one of the most natural places to hide. This paper is one full turn of that crank: the authors built the next attack and the next defense in the same document. The defense raises the cost. It doesn't end the game. And they're the first to say so.

22:25Cassidy: Which is maybe the most useful thing about it. The most quotable line in the whole paper isn't a number. It's the closing thesis — that these results point toward a new class of safety monitors that reason over groups of users, rather than isolated transcripts. That's the shift. From reading one room to watching the whole building.

22:47Tyler: And the thing I'll carry out of it is that the defense and the deepest vulnerability are the same fact. It works because the disguised pieces still smell alike. And it fails the moment an attacker is patient enough to change the smell. Defense and weakness, the same sentence. That's the shape of this entire field right now.

23:08Cassidy: That's a good note to land on. The paper, again, is "Stateful Online Monitoring Catches Distributed Agent Attacks," out of Penn and Carnegie Mellon. The show notes have a link to it and a few related reads if this caught you — the espionage report it builds on is worth your time on its own.

23:26Tyler: And if you want to keep pulling on this, paperdive.ai has the full transcript with every term defined inline, plus the concept pages that link this episode to the others we've done on monitoring and adversarial robustness.

23:40Cassidy: Thanks for spending it with us. This has been AI Papers: A Deep Dive.

How to Catch an AI Attack That No Single Conversation Reveals

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes