All episodes

Episode 024 · May 07, 2026 · 22 min

An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work

Lee, Kim, Zhang

Binary Security

AI Papers: A Deep Dive — Episode 024: An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work — cover art

paperdive.ai

Listen

Ep. 024

An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work

0:00

22 min

Concepts in this episode

AI & Security Agentic AI Agentic Vuln Discovery Exploit Generation Binary Analysis Use-After-Free Race Condition Exploits Agent Scaffolding Tool Use Dynamic Analysis Static Analysis LLM Coding Agents

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Agentic Vulnerability Reasoning on Windows COM Binaries

Venue

arXiv:2605.05000

Year

2026

Read the paper

arxiv.org/abs/2605.05000

Also available on

Apple Podcasts Spotify

Microsoft just paid $140,000 in bug bounties to an autonomous agent that found 28 previously unknown vulnerabilities in shipping Windows services and wrote working exploits for them. The same frontier models verified zero exploits with their default scaffolding and 26 with the right one — making this as much a story about tool design as about security.

What you'll take away

How slyp's two-stage 'scout then sapper' architecture goes from decompiled binary to a working proof-of-concept exploit against live Windows services
Why three purpose-built tool servers — binary explorer, COM inspector, live debugger — turn out to matter more than raw model capability
The headline result: 27 of 40 benchmark cases solved with full tooling, versus 0 of 40 for production coding agents on default settings
Real-world deployment numbers: 28 confirmed zero-days, 16 CVEs, three of them low-integrity-to-SYSTEM escalations
Why static analyzers cap out around 0.30 F1 on this bug class while semantic reasoning over decompiled code reaches 0.97
Honest limitations: benchmark circularity on most cases, 7–11 million tokens per case, and 'verified crash' is not yet weaponized RCE

Chapters

00:00The bug class: races in Windows COM services
02:43Why traditional tools struggle here
05:27slyp's architecture: three tool servers behind the model
08:11Scout then sapper: the two-stage pipeline
10:55Benchmark results and the scaffolding lesson
13:38Real-world deployment against Microsoft Windows
16:22Steelman critiques
19:06What generalizes beyond security

References in this episode

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering — Makes the same scaffolding-matters argument the episode highlights — that the in
Project Naptime: Evaluating Offensive Security Capabilities of Large Language Models — Google Project Zero's framework for LLM-driven vulnerability research, a direct
Teams of LLM Agents can Exploit Zero-Day Vulnerabilities — Earlier evidence for the offense-defense asymmetry the episode raises, focused o

Full transcript

Also available as a plain-text transcript page.

0:00Finn: Microsoft just paid out one hundred forty thousand dollars in bug bounties — to an AI. Twenty-eight previously unknown vulnerabilities in Windows. Sixteen CVEs assigned. All confirmed by Microsoft's own security response team.

0:15Hope: And the system that found them is the subject of a paper called "Agentic Vulnerability Reasoning on Windows COM Binaries," posted to arXiv on May sixth, twenty-twenty-six — we're recording the day after. Quick note before we dig in: this is an AI-generated deep dive. I'm Hope, that's Finn — we're both AI voices from Eleven Labs, the script is from Anthropic's Claude Opus 4.7, and the producer isn't affiliated with either company. The paper is by Hwiwon Lee, Jongseong Kim, and Lingming Zhang. With that out of the way: the system they built is called slyp, and the thing it does that no prior system has done is go all the way from "I read the binary" to "here is a program that crashes the service on demand."

1:04Finn: Right. And the gap between those two things is enormous. Finding suspicious code is one problem. Writing a working exploit against a live, closed-source Windows service running with elevated privileges — that's a completely different problem. slyp closes both, autonomously, against shipping production binaries.

1:25Hope: Let's actually walk through what kind of bug we're talking about, because the elegance of the attack class is part of why this result lands the way it does. The authors anchor everything on one example — a method called SetPrintTicket inside a Windows service that handles printing. The method does five things: it reads a heap pointer from an object, it frees that pointer, it allocates a new buffer, it stores the new pointer back into the object, and it copies data into the buffer. All five steps happen on a shared field. None of them happen with a lock.

2:04Finn: And the service it lives in is registered as multi-threaded — meaning two different threads can be inside that method at the exact same time, on the exact same object.

2:15Hope: Right. Imagine two people sharing a single notepad. The procedure is: erase the current note, write a new one, read it back. If they take turns, no problem. If they start at the same time — person A erases, person B erases, person A writes, person B writes over it — you get nonsense, and worse, neither person knows anything went wrong. Now make that notepad a piece of memory holding a pointer, and make "erase" mean "free the buffer this points at," and you have the bug. There are two bad interleavings. In the first, thread one frees the old buffer, allocates a new one, and stores the new pointer back. Then thread two — which entered the method a moment later — reads that new pointer and frees it, because in its world, it's still on step two. Now thread one tries to write into the buffer it just allocated. But that buffer was freed by thread two a moment ago. The write lands in freed memory. That's a use-after-free.

3:18Finn: And in the second interleaving, both threads see a non-null pointer at the start, and both decide to free it. Same pointer, freed twice. That's a double-free.

3:29Hope: Both of these are memory corruption primitives. In a privileged Windows service, that means a normal user just escalated to SYSTEM. Game over for the machine.

3:40Finn: And races like this are notoriously hard for traditional tools. Fuzzers — the standard automated approach where you throw random inputs at a target until it crashes — fuzzers can't reliably control thread scheduling. The bug only fires when the timing lines up just right. You can fuzz for hours and never hit the window. Static analyzers like COMRace, the previous state-of-the-art for this exact bug class, look for suspicious patterns in code structure. They need handcrafted signatures, they produce floods of false positives, and they miss bugs whenever the binary doesn't fit their structural assumptions. Manual reverse engineering works, but doesn't scale. A single COM service can have dozens of entry points, and reasoning about every possible thread interleaving across all of them is what burns a senior researcher out in an afternoon.

4:35Hope: So the bet the authors make is: the part that's hard isn't running tools. That's mechanical work. The part that's hard is semantic reasoning over decompiled code — reading the pseudocode that comes out of a decompiler and understanding what the program is actually trying to do. What state is shared. What happens if two threads interleave through this sequence. That's the part LLMs have gotten genuinely good at. Everything else — pulling the decompilation, resolving virtual function calls, looking up COM metadata, compiling and running candidate exploits, attaching a debugger and reading the crash — all of that should be done by deterministic tools the model just calls.

5:20Finn: Hope, this is where the architecture gets interesting. Tell me about the three tool servers, because I think that's the load-bearing piece of the system.

5:30Hope: Three servers, each with a specific job. The first is binary exploration. It wraps IDA Pro and Hex-Rays — the industry-standard reverse-engineering toolkit — and exposes things like "decompile this function," "find every place that calls this function," "show me the cross-references to this address." But the most important thing it does is automatic vtable resolution. In a COM binary, every interface method call is an indirect jump through a function pointer table. The decompiler leaves those calls as unresolved addresses in the pseudocode. So if you don't fix that, the agent reading the code sees "call whatever's at this offset" and has to do detective work to figure out which actual function gets invoked. The binary explorer does that detective work automatically and annotates the pseudocode with the real function name. The agent never has to think about it.

6:26Finn: Which is exactly the point. The tools embed the boring grunt work, so the model spends its tokens on the part that requires reasoning.

6:34Hope: The second server is COM inspection. To write a working exploit, you need to know how to even talk to the service in the first place — what identifier to use to activate it, what interface to bind to, what method signatures look like, what the threading model is, what security settings apply. None of that lives in the target binary. It lives in the Windows registry and in metadata libraries scattered across the system. So they expose fourteen tools that pull all of that information live. And the unsung hero of this server is one specific tool that generates a compilable C++ template — the boilerplate to activate a COM object and call into it. The agent can ask for that scaffolding and get back working code it just needs to fill in. That tool matters more than it sounds, for a reason that's about to come up in the results. The third server is dynamic debugging. The agent submits C++ source. The server compiles it, deploys it onto a Windows VM running under QEMU, executes it against the live service, and any crash is captured by an attached debugger — WinDbg. They turn on a feature called page heap that makes race-induced corruption produce deterministic crashes at the exact faulting instruction. So the agent gets a real crash report, with a real call stack, and can iterate.

8:06Finn: How many iterations, typically?

8:08Hope: Three to ten compile-debug cycles per successful exploit. The agent writes code, it compiles, it runs, it doesn't crash. The agent reads the result, refines the timing or the call sequence, tries again. It's the same loop a human exploit developer runs, just at machine speed and with no coffee breaks.

8:30Finn: And the whole pipeline is structured into two stages.

8:34Hope: Right. Stage one only gets the binary exploration tools. The agent reads decompiled code, follows call graphs, traces shared-state accesses, and produces a structured vulnerability report. Stage two takes that report and gets all three servers — binary exploration to re-check anything, COM inspection to figure out how to call the service, debugging to actually run exploit attempts. Each stage has focused context and clear success criteria, which matters because COM services are big and the agent's context window isn't infinite.

9:10Finn: So scout, then sapper. Stage one walks the terrain and writes the report. Stage two brings the explosives.

9:17Hope: That's a nice frame. The authors also add some long-horizon engineering around it — when the context window is about to overflow, they dump important findings to a structured file the agent can search later. They also have a middleware layer that catches the agent trying to give up too early and pushes it to keep working. Small things — but it's the kind of plumbing that turns a research demo into something that runs for an hour against a live service without losing its place.

9:50Finn: OK. The system exists, the architecture is clean. What does it actually do? Because this is where the paper either earns the headline or doesn't.

10:00Hope: Two evaluations. There's a benchmark, and there's a real-world deployment. The benchmark has twenty Windows COM services with forty vulnerability cases and ground-truth labels. They evaluate slyp against two production coding agents — Codex from OpenAI and Claude Code from Anthropic — and against COMRace plus plus, their improved reproduction of the prior state-of-the-art static analyzer.

10:27Finn: And the headline numbers?

10:28Hope: On discovery — just finding the bugs, not exploiting them — slyp hits an F1 score of zero point nine seven three. F1 is a single number between zero and one that balances how often the system's flagged bugs are real against how many real bugs it catches. Zero point nine seven means it's almost always right and almost never misses. The best baseline coding agent, given the same binary exploration tools, hits zero point nine five three. Close. With only basic decompilation and no extra tools, the baselines drop into the low nineties or below. And COMRace plus plus, the static analyzer — caps out at zero point two nine nine.

11:11Finn: Three times the F1 of the previous best automated approach. That's the gap between an editor reading for meaning and a grammar checker matching patterns. A grammar checker can flag a passive sentence; it can't tell you whether the sentence actually says what the author meant. Static analyzers are extraordinary at the patterns they're built for — they're not great at reasoning about whether a particular sequence of operations is actually unsafe in the context of this specific object's lifetime. That's the domain where semantic reasoning wins.

11:49Hope: But the real divergence comes in stage two — exploit generation. slyp at its best configuration solves twenty-seven of forty cases. About sixty-seven percent. The production coding agents, in their default setup with no extra tools, solve... zero. Out of forty. With just a basic submission harness — a way to actually run the code they generate — they get to thirteen of forty. With the full slyp toolset, Claude Code reaches twenty-six of forty. Same models. Dramatically different results.

12:23Finn: This is the point of the paper that travels furthest outside this domain. Pause on it. Same frontier model. Zero verified exploits with the default scaffold. Twenty-six verified exploits with the right scaffold. The gap isn't capability. It's the workshop.

12:41Hope: Right. Think of a skilled carpenter in an empty room with only their hands. They cannot build a cabinet, no matter how much they know. Give them a saw, a drill, a square, a workbench — suddenly they're productive. Give them a CNC machine and they're an order of magnitude faster. The model is the carpenter. The binary explorer, the COM inspector, the debugger — those are the tools. Production coding agents on frontier models verifying zero exploits without these tools is the empty-workshop result.

13:13Finn: And there's a corollary worth pulling out. The scaffolding gap is largest when the model is weakest. On the strongest configuration, ablating slyp's task verifier — that middleware that prevents the agent from giving up — barely moves F1, by about three thousandths of a point. But on a smaller model — they tested Haiku 4.5 — the gap between slyp's full pipeline and the baseline widens to zero point two zero eight in F1, with the task verifier highlighted as the component carrying most of that weight. On a weaker model, the scaffold is the entire game.

13:49Hope: That's a generalizable lesson. If you're building agents on a frontier model, you can probably get away with a thinner scaffold. If you're using a smaller, cheaper model, the scaffold is doing most of the work, and the difference between a good agent and a mediocre one is the harness around the model — not the model itself.

14:10Finn: OK. The benchmark numbers are striking. But the real-world deployment is where this lands as a security result, not just a methods paper.

14:19Hope: They run slyp as an iterative research campaign against production Microsoft Windows services. Twenty-eight previously unknown vulnerabilities. All confirmed by MSRC. Sixteen CVEs assigned, two merged with existing patches, ten confirmed without a CVE. Twenty-four use-after-frees and four double-frees — exactly the bug class our SetPrintTicket example showed. Across nine services: printing, filesystem deduplication, the Shell Infrastructure Host, Bluetooth, digital media, clipboard, network connections, Windows Installer.

14:54Finn: And the privilege escalation breakdown is worth lingering on. Twenty-two of the twenty-eight escalate from low to medium integrity. Three from medium to SYSTEM. And three go from low integrity directly to SYSTEM. That last category means: a normal user, not an admin, becomes the most privileged account on the box. From a user-level program. Just by triggering one of these races.

15:19Hope: And the bounty Microsoft paid out for all of this is one hundred forty thousand dollars.

15:24Finn: Which sounds like a lot, but the right comparison is human researcher time. My intuition — and the paper doesn't claim this directly — is that a skilled human reverse engineer hunting Windows COM races might find a handful of these in a year. slyp found twenty-eight across nine services in one campaign. If this kind of pipeline becomes routine, the offense-defense math shifts in a way defenders should think about. The assumption has to become: any reachable race-condition pattern in privileged code is likely to be discovered.

15:59Hope: Finn, I want to take the steelman seriously here, because the result is striking enough that it deserves real scrutiny.

16:06Finn: Yes. Four pushbacks. The first is benchmark circularity. Twelve of the twenty objects in their evaluation contain vulnerabilities slyp itself discovered during development. The authors address this explicitly — they carve out a subset of eight cases that are 1-day vulnerabilities found independently by other researchers, that slyp never saw. slyp performs comparably on that clean subset, zero point nine four three F1 versus zero point nine seven three overall. That's a reasonable defense. But the clean subset is eight cases. Strong claims about generalization need a much larger held-out set assembled by a third party — and we don't have one.

16:49Hope: That's fair. The methodology is honest, the response to circularity is principled, and the result holds on the clean subset — but the sample size on the truly clean subset is small.

17:01Finn: Second pushback. The static-analyzer comparison. COMRace's source isn't public, so the authors reimplemented it as COMRace plus plus. They validate their reproduction against COMRace's published findings, but the three-times F1 gap is against a tool the same team built. The gap is large enough that it almost certainly survives any reasonable reproduction-quality concerns — but the listener should know the comparison is structurally not a clean head-to-head against an external system.

17:32Hope: Third pushback?

17:33Finn: The exploit verification criterion is binary, not weaponization-grade. A "verified POC" here means a debugger captured a crash whose call stack reaches the target function. That's not the same as a real exploit with reliable code execution at SYSTEM. The paper is honest that it's targeting memory corruption primitives — the foothold that exploits get built on. The gap between "crash" and "remote code execution" is real engineering work this agent isn't doing. Yet.

18:03Hope: And the fourth?

18:04Finn: Cost. Stage two burns between seven and eleven million tokens per case at the strongest configurations. Not per benchmark. Per case. That's expensive enough that running slyp at scale across the full Windows attack surface has nontrivial economics, and the paper doesn't compare those token costs against the cost of a senior researcher's time at equivalent yield — which is the comparison anyone deciding whether to actually deploy this would want.

18:34Hope: Those are honest critiques, and they're the ones the authors themselves flag. They also call out two more. Race timing remains the bottleneck — even the strongest configuration leaves thirteen of forty cases unsolved, and almost half of submission attempts compile cleanly but fail to crash, because the timing window is hard to hit. And the system is dependent on the decompiler. If IDA Pro mis-infers a type or a control flow path, the agent ends up reasoning about a function that isn't quite the one that runs.

19:09Finn: Hope, what's your read on where this leaves the field?

19:12Hope: Well, Finn, three things worth pulling out for a listener who isn't going to read the paper itself. The first is that closed-source binary analysis is now in reach for agents. Most prior work on agentic vulnerability discovery worked on source code or capture-the-flag challenges — controlled environments. slyp works on shipping Windows binaries with no source access, recovering enough semantic structure from decompilation plus public Microsoft debug symbols to reason about object lifetimes. The architectural pattern — binary explorer plus domain introspector plus live debugger, all behind a standardized tool protocol — generalizes. The authors expect it to transfer to other commercial off-the-shelf binaries with prompt and tool adjustments.

20:02Finn: Second?

20:03Hope: The economic asymmetry, which we already touched on. If Finn's intuition about human researcher rates is even roughly right, slyp finding twenty-eight CVEs in a single campaign means defenders need to assume their reachable race-condition surface is much smaller than it looked.

20:21Finn: And the third?

20:22Hope: The scaffolding lesson. The paper is unusually clean about isolating what each piece of the system contributes. Binary exploration tools make discovery work. COM inspection and debugging tools make exploit generation work. With a frontier model and the right scaffold, the gap over weaker scaffolding shrinks. With a weaker model, the scaffold is the whole story. That's a transferable empirical data point for anyone building agents in any domain — not just security.

20:52Finn: There's something I find genuinely useful in how this result lands. The before-picture: discovery of race-condition bugs in closed-source Windows services was a domain of expert tools that found a few bugs and missed many. The after-picture: an agentic system that didn't exist a few years ago is outperforming those tools by three times on detection and producing actual working exploits, with confirmation from Microsoft itself.

21:20Hope: And the part of the result I keep coming back to is the production coding agents on default settings verifying zero exploits, while the same models with the right tools verify twenty-six. The capability was sitting in an empty workshop.

21:36Finn: The paper is by Hwiwon Lee, Jongseong Kim, and Lingming Zhang, at the University of Illinois at Urbana-Champaign, out earlier this month. The show notes have a link to the paper and related materials.

21:49Hope: Thanks for listening to AI Papers: A Deep Dive.

An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes