All episodes
Episode 024 · May 07, 2026 · 22 min

An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work

Lee, Kim, Zhang

Binary Security
AI Papers: A Deep Dive — Episode 024: An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work — cover art
paperdive.ai
Ep. 024
An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work
0:00
22 min
Paper
Agentic Vulnerability Reasoning on Windows COM Binaries
Venue
arXiv:2605.05000
Year
2026
Read the paper
arxiv.org/abs/2605.05000
Also available on
Apple Podcasts Spotify

Microsoft just paid $140,000 in bug bounties to an autonomous that found 28 previously unknown vulnerabilities in shipping Windows services and wrote working exploits for them. The same verified zero exploits with their default scaffolding and 26 with the right one — making this as much a story about tool design as about security.

What you'll take away

  • How 's two-stage 'scout then sapper' architecture goes from decompiled binary to a working proof-of-concept exploit against live Windows services
  • Why three purpose-built tool servers — binary explorer, inspector, live debugger — turn out to matter more than raw model
  • The headline result: 27 of 40 benchmark cases solved with full tooling, versus 0 of 40 for production coding on default settings
  • Real-world deployment numbers: 28 confirmed zero-days, 16 , three of them low-integrity-to-SYSTEM escalations
  • Why static analyzers cap out around 0.30 on this bug class while semantic reasoning over decompiled code reaches 0.97
  • Honest limitations: benchmark circularity on most cases, 7–11 million per case, and 'verified crash' is not yet weaponized RCE

Chapters

  1. 00:00The bug class: races in Windows COM services
  2. 02:43Why traditional tools struggle here
  3. 05:27slyp's architecture: three tool servers behind the model
  4. 08:11Scout then sapper: the two-stage pipeline
  5. 10:55Benchmark results and the scaffolding lesson
  6. 13:38Real-world deployment against Microsoft Windows
  7. 16:22Steelman critiques
  8. 19:06What generalizes beyond security

References in this episode

Also available as a plain-text transcript page.

0:00Finn: Microsoft just paid out one hundred forty thousand dollars in bug bounties — to an AI. Twenty-eight previously unknown vulnerabilities in Windows. Sixteen assigned. All confirmed by Microsoft's own security response team.

0:15Hope: And the system that found them is the subject of a paper called "Agentic Vulnerability Reasoning on Binaries," posted to arXiv on May sixth, twenty-twenty-six — we're recording the day after. Quick note before we dig in: this is an AI-generated deep dive. I'm Hope, that's Finn — we're both AI voices from Eleven Labs, the script is from Anthropic's , and the producer isn't affiliated with either company. The paper is by Hwiwon Lee, Jongseong Kim, and Lingming Zhang. With that out of the way: the system they built is called , and the thing it does that no prior system has done is go all the way from "I read the binary" to "here is a program that crashes the service on demand."

1:04Finn: Right. And the gap between those two things is enormous. Finding suspicious code is one problem. Writing a working exploit against a live, closed-source Windows service running with elevated privileges — that's a completely different problem. closes both, autonomously, against shipping production binaries.

1:25Hope: Let's actually walk through what kind of bug we're talking about, because the elegance of the attack class is part of why this result lands the way it does. The authors anchor everything on one example — a method called SetPrintTicket inside a Windows service that handles printing. The method does five things: it reads a heap pointer from an object, it frees that pointer, it allocates a new buffer, it stores the new pointer back into the object, and it copies data into the buffer. All five steps happen on a shared field. None of them happen with a lock.

2:04Finn: And the service it lives in is registered as multi-threaded — meaning two different threads can be inside that method at the exact same time, on the exact same object.

2:15Hope: Right. Imagine two people sharing a single notepad. The procedure is: erase the current note, write a new one, read it back. If they take turns, no problem. If they start at the same time — person A erases, person B erases, person A writes, person B writes over it — you get nonsense, and worse, neither person knows anything went wrong. Now make that notepad a piece of memory holding a pointer, and make "erase" mean "free the buffer this points at," and you have the bug. There are two bad interleavings. In the first, thread one frees the old buffer, allocates a new one, and stores the new pointer back. Then thread two — which entered the method a moment later — reads that new pointer and frees it, because in its world, it's still on step two. Now thread one tries to write into the buffer it just allocated. But that buffer was freed by thread two a moment ago. The write lands in freed memory. That's a .

3:18Finn: And in the second interleaving, both threads see a non-null pointer at the start, and both decide to free it. Same pointer, freed twice. That's a .

3:29Hope: Both of these are memory corruption primitives. In a privileged Windows service, that means a normal user just escalated to SYSTEM. Game over for the machine.

3:40Finn: And races like this are notoriously hard for traditional tools. Fuzzers — the standard automated approach where you throw random inputs at a target until it crashes — fuzzers can't reliably control thread scheduling. The bug only fires when the timing lines up just right. You can fuzz for hours and never hit the window. Static analyzers like , the previous state-of-the-art for this exact bug class, look for suspicious patterns in code structure. They need handcrafted signatures, they produce floods of false positives, and they miss bugs whenever the binary doesn't fit their structural assumptions. Manual reverse engineering works, but doesn't scale. A single service can have dozens of entry points, and reasoning about every possible thread interleaving across all of them is what burns a senior researcher out in an afternoon.

4:35Hope: So the bet the authors make is: the part that's hard isn't running tools. That's mechanical work. The part that's hard is semantic reasoning over decompiled code — reading the pseudocode that comes out of a decompiler and understanding what the program is actually trying to do. What state is shared. What happens if two threads interleave through this sequence. That's the part LLMs have gotten genuinely good at. Everything else — pulling the decompilation, resolving virtual function calls, looking up metadata, compiling and running candidate exploits, attaching a debugger and reading the crash — all of that should be done by deterministic tools the model just calls.

5:20Finn: Hope, this is where the architecture gets interesting. Tell me about the three tool servers, because I think that's the load-bearing piece of the system.

5:30Hope: Three servers, each with a specific job. The first is binary exploration. It wraps and — the industry-standard reverse-engineering toolkit — and exposes things like "decompile this function," "find every place that calls this function," "show me the cross-references to this address." But the most important thing it does is automatic resolution. In a binary, every interface method call is an indirect jump through a function pointer table. The decompiler leaves those calls as unresolved addresses in the pseudocode. So if you don't fix that, the reading the code sees "call whatever's at this offset" and has to do detective work to figure out which actual function gets invoked. The binary explorer does that detective work automatically and annotates the pseudocode with the real function name. The agent never has to think about it.

6:26Finn: Which is exactly the point. The tools embed the boring grunt work, so the model spends its on the part that requires reasoning.

6:34Hope: The second server is inspection. To write a working exploit, you need to know how to even talk to the service in the first place — what identifier to use to activate it, what interface to bind to, what method signatures look like, what the threading model is, what security settings apply. None of that lives in the target binary. It lives in the Windows registry and in metadata libraries scattered across the system. So they expose fourteen tools that pull all of that information live. And the unsung hero of this server is one specific tool that generates a compilable C++ template — the boilerplate to activate a COM object and call into it. The can ask for that scaffolding and get back working code it just needs to fill in. That tool matters more than it sounds, for a reason that's about to come up in the results. The third server is dynamic debugging. The agent submits C++ source. The server compiles it, deploys it onto a Windows VM running under , executes it against the live service, and any crash is captured by an attached debugger — . They turn on a feature called that makes race-induced corruption produce deterministic crashes at the exact faulting instruction. So the agent gets a real crash report, with a real call stack, and can iterate.

8:06Finn: How many iterations, typically?

8:08Hope: Three to ten compile-debug cycles per successful exploit. The writes code, it compiles, it runs, it doesn't crash. The agent reads the result, refines the timing or the call sequence, tries again. It's the same loop a human exploit developer runs, just at machine speed and with no coffee breaks.

8:30Finn: And the whole pipeline is structured into two stages.

8:34Hope: Right. Stage one only gets the binary exploration tools. The reads decompiled code, follows call graphs, traces shared-state accesses, and produces a structured vulnerability report. Stage two takes that report and gets all three servers — binary exploration to re-check anything, inspection to figure out how to call the service, debugging to actually run exploit attempts. Each stage has focused context and clear success criteria, which matters because COM services are big and the agent's isn't infinite.

9:10Finn: So scout, then sapper. Stage one walks the terrain and writes the report. Stage two brings the explosives.

9:17Hope: That's a nice frame. The authors also add some long-horizon engineering around it — when the is about to overflow, they dump important findings to a structured file the can search later. They also have a middleware layer that catches the agent trying to give up too early and pushes it to keep working. Small things — but it's the kind of plumbing that turns a research demo into something that runs for an hour against a live service without losing its place.

9:50Finn: OK. The system exists, the architecture is clean. What does it actually do? Because this is where the paper either earns the headline or doesn't.

10:00Hope: Two evaluations. There's a benchmark, and there's a real-world deployment. The benchmark has twenty services with forty vulnerability cases and ground-truth labels. They evaluate against two production coding from OpenAI and from Anthropic — and against plus plus, their improved reproduction of the prior state-of-the-art static analyzer.

10:27Finn: And the headline numbers?

10:28Hope: On discovery — just finding the bugs, not exploiting them — hits an of zero point nine seven three. F1 is a single number between zero and one that balances how often the system's flagged bugs are real against how many real bugs it catches. Zero point nine seven means it's almost always right and almost never misses. The best baseline coding , given the same binary exploration tools, hits zero point nine five three. Close. With only basic decompilation and no extra tools, the baselines drop into the low nineties or below. And plus plus, the static analyzer — caps out at zero point two nine nine.

11:11Finn: Three times the of the previous best automated approach. That's the gap between an editor reading for meaning and a grammar checker matching patterns. A grammar checker can flag a passive sentence; it can't tell you whether the sentence actually says what the author meant. Static analyzers are extraordinary at the patterns they're built for — they're not great at reasoning about whether a particular sequence of operations is actually unsafe in the context of this specific object's lifetime. That's the domain where semantic reasoning wins.

11:49Hope: But the real divergence comes in stage two — exploit generation. at its best configuration solves twenty-seven of forty cases. About sixty-seven percent. The production coding , in their default setup with no extra tools, solve... zero. Out of forty. With just a basic submission — a way to actually run the code they generate — they get to thirteen of forty. With the full slyp toolset, reaches twenty-six of forty. Same models. Dramatically different results.

12:23Finn: This is the point of the paper that travels furthest outside this domain. Pause on it. Same . Zero verified exploits with the default scaffold. Twenty-six verified exploits with the right scaffold. The gap isn't . It's the workshop.

12:41Hope: Right. Think of a skilled carpenter in an empty room with only their hands. They cannot build a cabinet, no matter how much they know. Give them a saw, a drill, a square, a workbench — suddenly they're productive. Give them a CNC machine and they're an order of magnitude faster. The model is the carpenter. The binary explorer, the inspector, the debugger — those are the tools. Production coding on verifying zero exploits without these tools is the empty-workshop result.

13:13Finn: And there's a corollary worth pulling out. The scaffolding gap is largest when the model is weakest. On the strongest configuration, 's task — that middleware that prevents the from giving up — barely moves , by about three thousandths of a point. But on a smaller model — they tested 4.5 — the gap between slyp's full pipeline and the baseline widens to zero point two zero eight in F1, with the task verifier highlighted as the component carrying most of that . On a weaker model, the scaffold is the entire game.

13:49Hope: That's a generalizable lesson. If you're building on a , you can probably get away with a thinner scaffold. If you're using a smaller, cheaper model, the scaffold is doing most of the work, and the difference between a good agent and a mediocre one is the around the model — not the model itself.

14:10Finn: OK. The benchmark numbers are striking. But the real-world deployment is where this lands as a security result, not just a methods paper.

14:19Hope: They run as an iterative research campaign against production Microsoft Windows services. Twenty-eight previously unknown vulnerabilities. All confirmed by . Sixteen assigned, two merged with existing patches, ten confirmed without a CVE. Twenty-four and four — exactly the bug class our SetPrintTicket example showed. Across nine services: printing, filesystem deduplication, the Shell Infrastructure Host, Bluetooth, digital media, clipboard, network connections, Windows Installer.

14:54Finn: And the privilege escalation breakdown is worth lingering on. Twenty-two of the twenty-eight escalate from low to medium integrity. Three from medium to SYSTEM. And three go from low integrity directly to SYSTEM. That last category means: a normal user, not an admin, becomes the most privileged account on the box. From a user-level program. Just by triggering one of these races.

15:19Hope: And the bounty Microsoft paid out for all of this is one hundred forty thousand dollars.

15:24Finn: Which sounds like a lot, but the right comparison is human researcher time. My intuition — and the paper doesn't claim this directly — is that a skilled human reverse engineer hunting races might find a handful of these in a year. found twenty-eight across nine services in one campaign. If this kind of pipeline becomes routine, the offense-defense math shifts in a way defenders should think about. The assumption has to become: any reachable race-condition pattern in privileged code is likely to be discovered.

15:59Hope: Finn, I want to take the steelman seriously here, because the result is striking enough that it deserves real scrutiny.

16:06Finn: Yes. Four pushbacks. The first is benchmark circularity. Twelve of the twenty objects in their evaluation contain vulnerabilities itself discovered during development. The authors address this explicitly — they carve out a subset of eight cases that are 1-day vulnerabilities found independently by other researchers, that slyp never saw. slyp performs comparably on that clean subset, zero point nine four three versus zero point nine seven three overall. That's a reasonable defense. But the clean subset is eight cases. Strong claims about generalization need a much larger held-out set assembled by a third party — and we don't have one.

16:49Hope: That's fair. The methodology is honest, the response to circularity is principled, and the result holds on the clean subset — but the sample size on the truly clean subset is small.

17:01Finn: Second pushback. The static-analyzer comparison. 's source isn't public, so the authors reimplemented it as COMRace plus plus. They validate their reproduction against COMRace's published findings, but the three-times gap is against a tool the same team built. The gap is large enough that it almost certainly survives any reasonable reproduction-quality concerns — but the listener should know the comparison is structurally not a clean head-to-head against an external system.

17:32Hope: Third pushback?

17:33Finn: The exploit verification criterion is binary, not weaponization-grade. A "verified POC" here means a debugger captured a crash whose call stack reaches the target function. That's not the same as a real exploit with reliable code execution at SYSTEM. The paper is honest that it's targeting memory corruption primitives — the foothold that exploits get built on. The gap between "crash" and "remote code execution" is real engineering work this isn't doing. Yet.

18:03Hope: And the fourth?

18:04Finn: Cost. Stage two burns between seven and eleven million per case at the strongest configurations. Not per benchmark. Per case. That's expensive enough that running at scale across the full Windows attack surface has nontrivial economics, and the paper doesn't compare those token costs against the cost of a senior researcher's time at equivalent yield — which is the comparison anyone deciding whether to actually deploy this would want.

18:34Hope: Those are honest critiques, and they're the ones the authors themselves flag. They also call out two more. Race timing remains the bottleneck — even the strongest configuration leaves thirteen of forty cases unsolved, and almost half of submission attempts compile cleanly but fail to crash, because the timing window is hard to hit. And the system is dependent on the decompiler. If mis-infers a type or a control flow path, the ends up reasoning about a function that isn't quite the one that runs.

19:09Finn: Hope, what's your read on where this leaves the field?

19:12Hope: Well, Finn, three things worth pulling out for a listener who isn't going to read the paper itself. The first is that closed-source binary analysis is now in reach for . Most prior work on agentic vulnerability discovery worked on source code or capture-the-flag challenges — controlled environments. works on shipping Windows binaries with no source access, recovering enough semantic structure from decompilation plus public Microsoft debug symbols to reason about object lifetimes. The architectural pattern — binary explorer plus domain introspector plus live debugger, all behind a standardized tool protocol — generalizes. The authors expect it to transfer to other commercial off-the-shelf binaries with prompt and tool adjustments.

20:02Finn: Second?

20:03Hope: The economic asymmetry, which we already touched on. If Finn's intuition about human researcher rates is even roughly right, finding twenty-eight in a single campaign means defenders need to assume their reachable race-condition surface is much smaller than it looked.

20:21Finn: And the third?

20:22Hope: The scaffolding lesson. The paper is unusually clean about isolating what each piece of the system contributes. Binary exploration tools make discovery work. inspection and debugging tools make exploit generation work. With a and the right scaffold, the gap over weaker scaffolding shrinks. With a weaker model, the scaffold is the whole story. That's a transferable empirical data point for anyone building in any domain — not just security.

20:52Finn: There's something I find genuinely useful in how this result lands. The before-picture: discovery of race-condition bugs in closed-source Windows services was a domain of expert tools that found a few bugs and missed many. The after-picture: an system that didn't exist a few years ago is outperforming those tools by three times on detection and producing actual working exploits, with confirmation from Microsoft itself.

21:20Hope: And the part of the result I keep coming back to is the production coding on default settings verifying zero exploits, while the same models with the right tools verify twenty-six. The was sitting in an empty workshop.

21:36Finn: The paper is by Hwiwon Lee, Jongseong Kim, and Lingming Zhang, at the University of Illinois at Urbana-Champaign, out earlier this month. The show notes have a link to the paper and related materials.

21:49Hope: Thanks for listening to AI Papers: A Deep Dive.