Why a Constrained Pipeline Beat a Full Coding Agent at Finding Bugs 30-to-1
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
A frontier coding agent given full access to ten major open-source projects found twelve security bugs. A constrained pipeline using the same model class found three hundred seventy-nine. The gap isn't about compute — it's an argument about where LLMs actually belong in a rigorous engineering stack.
What you'll take away
- Why symbolic execution has been 'almost practical' for fifty years, and what specifically was blocking it from going mainstream
- The architectural move at the heart of SAILOR: the LLM writes the test harness, but never gets to declare a bug — deterministic tools do
- Why iteration matters so much: removing the feedback loop drops confirmed bugs from 379 to zero
- The three projects where SAILOR found nothing (curl, OpenSSL, SQLite) and what that tells you about which codebases this approach fits
- Why 40% of the bugs found are essentially invisible to standard fuzzing, and what that means for the current state of automated security testing
- A general pattern for deploying LLMs in serious engineering work: route every model output through tools whose failure modes are independent of the model's
Chapters
- 00:00The 12-versus-379 result
- 04:01Why symbolic execution never went mainstream
- 08:03Epistemic decomposition: detective, locksmith, forensics lab
- 12:05A real bug from start to finish
- 16:07What the bug counts actually look like
- 20:09The honest limitations
- 24:11Why the pattern generalizes
- 28:13What's next and what to watch
References in this episode
- KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs — The foundational symbolic execution engine paper that defines the 'precision ins
- Fuzzing: Hayes, Miller, et al. — A Survey of Symbolic Execution Techniques — A comprehensive survey of why symbolic execution has been 'almost practical for
Full transcript
Also available as a plain-text transcript page.
0:00Hope: Here's the result that made me sit up. You take a state-of-the-art coding agent — Claude Opus, full access to a giant codebase, unlimited turns, told to go find security bugs. It comes back with twelve. You take the same problem, hand it to a pipeline that never lets the model see more than a sliver at a time, gives it a sixty-turn budget per attempt, and constrains everything it can do. That pipeline finds three hundred and seventy-nine previously-unknown memory-safety vulnerabilities in code as battle-tested as OpenSSL and FFmpeg and GNU Binutils.
0:38Eric: Twelve versus three-seventy-nine. Same model class, same target codebases. The constrained system finds roughly thirty times more bugs than the agent that was handed everything.
0:50Hope: That's the paper we're digging into — "Guiding Symbolic Execution with Static Analysis and LLMs for Vulnerability Discovery," by a team at UC Santa Barbara, posted to arXiv in early April twenty-twenty-six and recorded a few weeks later. Quick note before we get into it: this episode is AI-generated. The script is from Anthropic's Claude Opus 4.7. I'm Hope, that's Eric — we're both AI voices from Eleven Labs, and the show isn't affiliated with either company. And the reason we wanted a full episode on this particular paper is that the gap between twelve and three-seventy-nine is doing something more interesting than just "bigger pipeline wins." It's an argument about where LLMs actually belong in a rigorous engineering stack.
1:41Eric: Right. And the system has a name — they call it SAILOR. The headline finding is the bug count, but the deeper claim is architectural. They argue that the right way to use a language model in this kind of work is not as the bug-finder, and not even as the planner. It's as the construction worker who builds the test rig.
2:03Hope: Let me set up why that framing is surprising, because it gets at something that's been stuck in software security for decades. There's this technique called symbolic execution. The idea is genuinely beautiful. Normally when you test a program, you run it with concrete inputs — the number five, a specific file, whatever. Symbolic execution does something different. It runs the program with placeholders that stand for "any possible value." Every time the code hits a branch — `if x is greater than ten` — the engine forks. One copy explores the path where x is greater than ten, another explores where x is not, and each tracks the constraints down its path. At the end, you ask a solver: is there a concrete input that would reach this exact line and cause a buffer overflow? If yes, it hands you the input.
2:58Eric: Mathematically precise, reproducible bug witnesses. When it works, it's the gold standard.
3:04Hope: When it works. The technique has been around since the nineteen-seventies. It's been "almost practical" for fifty years. And the reason it never went mainstream is not the engine — the engine is fine. The reason is that symbolic execution can't just be pointed at a project. It needs a harness.
3:24Eric: Which is the part most people outside this corner of the field don't appreciate. A harness is a custom test driver. For each thing you want to symbolically execute, somebody — usually an expert who knows both the target codebase and the analysis engine inside out — has to write a small program that picks an entry point, allocates the right project-specific data structures, marks certain values as symbolic, stubs out irrelevant code, and tells the engine what counts as a violation.
3:57Hope: Think of symbolic execution as a precision instrument. An electron microscope, say. The instrument itself is extraordinary, but you can't just toss a rock on the stage. The sample has to be sliced thin and mounted and oriented before the microscope can do anything. For half a century, every codebase needed an expert to do that prep work by hand, which is why the microscope mostly sat unused.
4:24Eric: And the question this paper asks is a fairly direct one. LLMs are pretty good at reading code and writing code. Can they do the sample prep?
4:34Hope: That's it. That's the whole pivot. Don't ask the LLM to find bugs. Ask it to write the test setup that lets a precise but brittle bug-finder reach the suspicious spots a static analyzer already flagged. And let the static analyzer answer where, and let the symbolic engine answer whether, and let a real binary crash answer is-it-real.
4:57Eric: So three components, three questions. That's the architectural claim — they call it epistemic decomposition. Each component is good at exactly one thing and is not allowed to answer questions outside its lane.
5:12Hope: The analogy I keep coming back to is a detective, a locksmith, and a forensics lab. The detective canvasses the city and produces a list of houses where something suspicious might have happened — fast, broad, mostly wrong. The locksmith goes to each suspicious house and figures out how to construct a key to get inside — skilled, but useless without a target list. The forensics lab takes whatever the locksmith found and tests it against physical evidence — slow, expensive, definitive. None of them alone solves crimes. Together they do. And critically, the locksmith doesn't get to declare guilt. The lab does.
5:54Eric: The LLM is the locksmith. That's the move.
5:57Hope: Let me walk through what each of those phases actually looks like — and the cleanest way to do it is to follow one real bug through the whole pipeline. The authors include a running example: a heap buffer overflow they discovered in GNU Binutils, which is the toolchain that ships with basically every Linux system on Earth.
6:19Eric: A reasonably consequential place to find a memory bug.
6:22Hope: Right. Inside the linker, there's a function with a memcpy — that copies some number of bytes from one buffer to another. There are two null-pointer guards before the copy. But nothing checks that the size actually fits inside the destination. If size exceeds the allocation, you've stomped on memory you don't own. Classic heap buffer overflow. To find this automatically, you have to answer three questions: where is the bug, how do you reach it from outside, and is it actually triggerable. SAILOR's three phases each answer exactly one. Phase one is the detective. They run a static analyzer called CodeQL across the project. CodeQL treats source code like a database — you can write queries against it asking things like "find every memory copy where the length argument isn't bounds-checked." Out of the box CodeQL ships with rules for the standard vulnerability patterns, and the SAILOR team wrote some custom ones to fill in gaps. The unchecked-copy-length rule alone produces six hundred nineteen findings just in Binutils. Across all ten projects they tested, after some filtering for test code, they end up with eighty-seven thousand candidate targets.
7:40Eric: Eighty-seven thousand. Which is a number that should make you flinch a little, because static analysis has a notorious false positive rate. The going estimate is something like ninety-nine percent of those flags will turn out not to be real bugs. The whole field has spent decades drowning in static analysis output...
8:01Hope: And SAILOR's response to that is essentially: fine, we'll treat all eighty-seven thousand as hypotheses. We're not going to ask a human to triage them. We're going to send each one to phase two, and let execution decide. For each finding, phase one packages a small specification — a JSON document with the location, a natural-language description of the suspected pattern, the surrounding function calls, an assertion template, and the build context. For the Binutils memcpy, the assertion template basically says: the copy length must not exceed the smaller of the source and destination sizes. That's the property the harness will eventually try to violate.
8:46Eric: And because each spec is self-contained, you can fan all of them out across machines in parallel. That's part of why the pipeline can chew through eighty-seven thousand targets at all.
8:59Hope: Then phase two — the locksmith. This is where the LLM lives. For each spec, an orchestrator runs a budgeted loop. Up to sixty turns. The LLM gets the spec, and three things start happening. First, it reads. It uses tool calls to pull function signatures, struct definitions, headers from the project source. It needs this because the structures involved are gnarly. In the Binutils example, the relevant struct has more than forty fields, including embedded sub-structures and function pointers. Second, it writes. It writes three things: a driver — basically a `main` function that allocates the entry function's arguments and marks the right fields as symbolic — a code slice that contains only the call chain from the entry to the suspect line with everything else stubbed out, and a set of assertions that tell the symbolic engine what counts as triggering the bug. Third — and this is the load-bearing part — it iterates. Eric, let me hand this back to you, because the iteration loop is where the system either works or doesn't, and the actual transcript for the Binutils bug is one of the best things in the paper.
10:16Eric: Yeah, the round-by-round is wonderful. So picture a student and a strict teaching assistant. The LLM writes a draft. The compiler and the symbolic engine play the role of the TA. They hand back error messages. The LLM revises. Round one: the LLM produces an initial harness. It gets compile errors. Some of them are because the LLM took the forty-field struct and reduced it to just the three fields it thought were necessary — a kind of stub — and the compiler is now complaining that one of those three fields is missing because it's actually defined in a different header. The orchestrator pattern-matches the error and grep's the project headers for the real definition. Round five: the harness compiles. The symbolic engine runs. And it reports something that isn't success but isn't quite failure either. It says: I reached the entry function, but I never reached the line you wanted me to test. The LLM had gotten too aggressive with stubs in an earlier round — it had taken one of the null-pointer guards and replaced it with `if-true`, effectively neutering the guard. Which means execution wasn't going down the same path anymore. The fix is to restore the guard in the slice and add a constraint in the driver that forces execution down the dangerous branch. Round six: the engine reports a memory error. Copy size of seventeen, against a sixteen-byte destination. That's the witness.
11:53Hope: And this is happening autonomously. There's no human reading these errors and patching the harness. The orchestrator is classifying the error type and either fixing it directly or feeding it back to the model with hints.
12:09Eric: Exactly. The whole back-and-forth is automated. The orchestrator has a few categories of compile error it knows how to augment — incomplete type, conflicting prototype, that kind of thing — and for the symbolic-execution feedback it has three buckets: not reached, site reached but no error, and actually triggered. Each bucket suggests a different kind of fix.
12:34Hope: And the reason this matters — the reason iteration specifically is load-bearing — shows up in the ablations. They tried a version of SAILOR with no iterative loop. One-shot LLM. Generate a harness, compile it, run it, take whatever you got. Compile rate dropped from forty-four percent to nineteen percent. Confirmed bug count dropped to zero.
12:59Eric: From three hundred seventy-nine to zero.
13:01Hope: The model is bad at writing correct harnesses on the first try. It's good at fixing specific mistakes when you tell it precisely what's wrong. That asymmetry is the entire reason this pipeline works.
13:16Eric: So that's phase two. The locksmith has produced a key. We have a concrete witness — copy size seventeen, sixteen-byte buffer — that the symbolic engine claims will trigger the overflow. But everything we've done so far has been against a sliced, stubbed, modified version of the code. We changed things to make the symbolic execution tractable. So how do we know the bug is real in the actual codebase?
13:45Hope: Phase three. The forensics lab. And it's almost embarrassingly simple given how careful phase two had to be. You take the symbolic harness and rewrite it as a concrete one — every place it said "make this symbolic," you instead `memcpy` the witness bytes into the right slot. You compile the unmodified, original project source with AddressSanitizer turned on. AddressSanitizer is a compiler feature that puts tripwires around every memory allocation. If the program steps over one, it halts and prints a stack trace. You link the concrete driver against the real library. You run it. If it crashes, and the crash lands inside the project's own code rather than your driver, it counts. For the Binutils bug, AddressSanitizer reports a heap buffer overflow — an eight-byte read landing four bytes past a four-byte allocation — with a stack trace pointing at the original linker function. The symbolic engine was right. The bug exists in real, unmodified, shipped code.
14:56Eric: And this is the part of the architecture I think is most worth pausing on, Hope. The LLM never gets to be the judge. The LLM wrote the test rig. But the verdict comes from a deterministic constraint solver, and then from a real binary actually crashing under a sanitizer. If the LLM hallucinated something — invented an API that doesn't exist, claimed a struct field that isn't there — the compiler catches it. If the LLM wrote a harness that doesn't actually reach the suspect line, the symbolic engine catches it. If the symbolic engine produces a witness that doesn't actually crash the real code, AddressSanitizer catches it.
15:42Hope: Three independent gates. Each one catches a specific failure mode of the previous component. That's what they mean by epistemic decomposition.
15:52Eric: Three hundred seventy-nine bugs survived all three gates. Across roughly six point eight million lines of C and C-plus-plus code. Major projects. Some of the most-audited open-source software on the planet.
16:07Hope: Let me put some of those numbers around the room, because they're worth sitting with. mupdf — the PDF rendering library — one hundred forty-one vulnerabilities. FFmpeg, the video toolkit — seventy-eight. Binutils — fifty-two. libpng — twenty-one, in a library that's only sixty-three thousand lines long, and notably, the libpng bugs require very specific PNG chunk-type and bit-depth combinations that random fuzzing essentially never hits. Symbolic execution finds them because it can reason structurally rather than just mutating bytes.
16:46Eric: That last point matters more than it sounds. Out of the four hundred twenty-one confirmed crashes — they collapse to three seventy-nine after deduplication — about sixty percent could also be reproduced by a fuzzer once you seeded it with the witness SAILOR found. The other forty percent essentially could not. They require multi-field struct setup that random mutation almost never reconstructs. Those are bugs that exist and are not findable by the dominant approach to automated security testing today.
17:23Hope: Which is one of those statements that sounds incremental and is actually structural. Forty percent of these vulnerabilities were sitting in production code, in widely-deployed open-source libraries, and the standard tools could not find them. SAILOR can.
17:42Eric: Now — this is where we need to talk about the comparisons, because one of the comparisons is the punchline of the whole paper.
17:52Hope: The agentic baseline. Same family of model, but pointed at the problem differently. They gave Claude Opus full access to the codebase. No CodeQL, no symbolic execution, no harness scaffolding. Just: here's the source tree, find security bugs and produce inputs that exploit them. Unlimited turns. Genuinely the best shot you can give a frontier coding model at this task end-to-end. It produced four hundred twenty-five crashing inputs. Of those, only one hundred and five actually crashed when you ran them. Of the ones that crashed, fifty-one percent were duplicates targeting the same line over and over — every single one of the eleven SQLite crashes hit literally the same line in the same file. After deduplication, fifty-one unique crash locations. After AddressSanitizer validation against the real, unmodified library, twelve survived.
18:54Eric: Twelve. Versus three seventy-nine from the constrained pipeline.
18:59Hope: And the natural reading is "well, the bigger system won." But I think the more honest reading is the opposite. The natural reading is wrong because the comparison is not "more compute beats less compute." It's "asking a different question beats asking the original question harder."
19:18Eric: Yeah. The agent was being asked to find a needle in a haystack. SAILOR is asking the LLM, eighty-seven thousand separate times, "test whether this specific straw is the needle." Constraint isn't a limitation here — it's what makes the task tractable for the kind of system an LLM actually is.
19:38Hope: LLMs are bad at deciding what to look at. They're decent at executing a tightly scoped task you've already framed for them. SAILOR's structure absorbs the deciding-what-to-look-at problem into the static analyzer, which is — for all its noise — much better at being indiscriminately broad than the LLM is.
20:00Eric: There's a line in the paper I keep coming back to. They say: no single technique can answer all three questions needed to identify the vulnerability. Where, how, and whether. Each of the three components is bad at the other two. Together, they form a system whose failure modes don't compound — they cancel.
20:21Hope: Eric, this is where I want to push, though, because there's a steelman version of "this number is too good to be true," and the authors are actually pretty candid about parts of it.
20:33Eric: Go ahead.
20:34Hope: Three things. First — the confirmation rate. SAILOR processed eighty-seven thousand specifications and confirmed four hundred twenty-one crashes. That's a rate of about one half of one percent. Most of the work the system does fails. That's fine in absolute terms — being wrong cheaply eighty-seven thousand times still produces three hundred seventy-nine hits — but a practitioner deploying this needs to understand that the failure modes are project-shaped. Some libraries are intrinsically hostile to this approach.
21:10Eric: And we have direct evidence of that. Three of the ten projects returned zero confirmed bugs. curl, OpenSSL, SQLite.
21:17Hope: Right. And the reasons are interesting and they're not "the system is broken." curl needs a multi-step session setup — you have to call init, then call setopt several times, register callbacks — before any of the interesting code runs. That sequence doesn't fit in sixty turns. OpenSSL has an extremely complex internal type hierarchy and multi-step initialization for cryptographic contexts that the LLM couldn't reconstruct. And SQLite — this one is almost poetic — the symbolic engine actually triggered twelve errors inside the B-tree code, but they all required a valid database state. A schema, a page cache, an open handle. Raw symbolic byte values can't reconstruct a coherent database from nothing.
22:05Eric: And the paper doesn't hide this. They report the three zeros. They explain them. The system burned about eight hundred fifty million tokens — call it thirty-seven percent of the entire compute budget — on those three projects, and produced nothing on them.
22:22Hope: That's the honest framing. SAILOR works well on projects whose entry points can be exercised with a small, locally-constructed input. It struggles on projects whose interesting code depends on substantial pre-existing state. Which is, by the way, a useful piece of information for anyone thinking about deploying this — there's a real question about which codebases this approach will pay off on before you spend any compute.
22:50Eric: That's one. What were the other two?
22:53Hope: Second — confirmation under AddressSanitizer is not the same as exploitability. A confirmed crash means there is some sequence of bytes that, when fed through this driver, makes the real binary commit a memory error. That's a real bug. But forty percent of these crashes can't be reproduced by a fuzzer. Which suggests they may require attacker control over internal program state that isn't reachable through normal program inputs. The paper is careful to say "memory-safety vulnerabilities" rather than "exploitable vulnerabilities" — but the framing invites the stronger reading, and a careful reader should resist it.
23:36Eric: That's fair. Some fraction of these are bugs in the strict sense — undefined behavior, latent corruption — that may never be reachable in practice through actual user input. They're worth fixing, they're worth knowing about, but they're not all the kind of thing that lets someone take over a server.
23:57Hope: And third — and this one the authors flag themselves — deduplication. They collapse confirmed crashes by file, function, and line. So if thirty different symbolic paths reach the same root cause, those count as one bug. Which is the right move on average. But it cuts both ways. Two genuinely distinct bugs on the same line would get undercounted, and conversely there are probably cases where what looks like a single line is actually multiple distinct trigger conditions. The number three hundred seventy-nine is an estimate, not a precise count.
24:35Eric: I'd add one more — there's a comparison worth flagging that the authors don't quite make cleanly. The agentic baseline isn't a perfectly fair comparison. The agent was asked to find and exploit bugs end-to-end. SAILOR was given the static analysis findings as a starting set. A fairer comparison would feed the agent the same per-target specs SAILOR gets.
25:00Hope: They tried something close to that, actually. There's a baseline labeled B4 in the paper, which is essentially a static-analysis-guided LLM with raw CodeQL findings. It produced two confirmed bugs.
25:13Eric: Two. So even handing the LLM the same starting hypotheses, without the iterative loop and the symbolic execution gate, you get nowhere near three seventy-nine.
25:24Hope: Which suggests the harness-writing-with-feedback loop and the symbolic execution gate really are doing the work. It's not just CodeQL pre-filtering that matters. It's the feedback dialogue that catches the LLM's hallucinations and the symbolic engine that proves reachability.
25:43Eric: There's also a methodological question worth raising. The main results use GPT-5, and these are public open-source projects, which means the model very plausibly saw them during training. Could the system be just regurgitating bugs it had memorized?
26:00Hope: The authors thought about that. They re-ran the libtiff portion of the experiment with DeepSeek-V3.2 — a different model from a different lab — and recovered eighty-six percent of the GPT-5 results. That's not a complete answer, but it's a partial defense. The pipeline isn't critically dependent on one specific model's training data. Different models, broadly the same bug set.
26:26Eric: Hope, I want to come back to the architectural argument, because I think it generalizes beyond this paper.
26:33Hope: The connective-tissue framing.
26:35Eric: Yeah. The naive way to use an LLM in a serious engineering pipeline is as an oracle. "Hey model, find bugs in this code." That tends to produce confident-sounding nonsense, because the model has no way to ground its claims. SAILOR is the opposite. The LLM never claims a bug. It writes scaffolding. The verdict comes from a deterministic solver and a real binary crashing under a sanitizer. What's appealing about that pattern is that it doesn't require the LLM to be reliable on its own. It only requires the LLM to be useful enough that — combined with rigorous downstream gates — the joint system is reliable. And the downstream gates are independent. They don't share the model's failure modes.
27:20Hope: Which is a structurally different way to think about LLM deployment than what most of the field is doing right now. The dominant pattern in AI engineering today is "let the agent loop until it thinks it's done, then trust its self-report." That works some of the time, fails badly the rest of the time, and the failure modes look like the agentic baseline here — confident outputs, low real-world hit rate, lots of duplicates because the model's notion of "this is a different bug" is whatever it happens to be in that context window.
27:54Eric: SAILOR's pattern is "let the agent generate, but route everything through tools that don't share its weaknesses." And those tools don't have to be a constraint solver and a sanitizer. They could be a theorem prover. They could be a type checker. They could be a database with foreign-key constraints. Whatever the rigorous, deterministic check is for the domain you're working in.
28:18Hope: The locksmith doesn't get to declare guilt. The forensics lab does. And the field is going to have to figure out, domain by domain, what the forensics lab is for that domain.
28:29Eric: For program analysis specifically, this paper is making an argument that I think will hold up. Symbolic execution has been "almost practical" for fifty years. The bottleneck was never the engine. It was the human labor of harness-writing. And that bottleneck — for a meaningful subset of codebases, with the caveats we laid out — appears to have just been removed.
28:56Hope: Three hundred seventy-nine bugs in well-audited open-source code is a strong empirical signal. The architectural argument is what makes me think it'll keep generalizing. Symbolic execution is rigorous but brittle. Static analysis is broad but noisy. Language models are creative but unreliable. Pick the right division of labor, and the strengths combine while the weaknesses don't.
29:24Eric: One last detail I want to surface, because it's a nice grounding number. The full pipeline cost about two point three billion LLM tokens across all ten projects. Average twenty-six thousand tokens per specification. Per-bug cost ranges from less than a million tokens — that's mupdf — up to twenty-nine million, which is libxml2. Eighty-seven thousand parallel attempts, two point three billion tokens, three hundred seventy-nine confirmed bugs.
29:56Hope: Which, if you squint, is the economics of the whole approach. Each individual attempt is cheap and fails. The collective volume produces real findings. It's only viable because each attempt is independent and the gates are deterministic. You don't need a human in the loop until you have a confirmed crash with a stack trace and a witness to hand over. Eric, what's your read on where this goes next?
30:25Eric: The obvious extensions are non-memory-safety bugs — race conditions, logic flaws, crypto misuse. Those don't decompose cleanly into the assertion-template structure SAILOR uses, because there's no equivalent of AddressSanitizer that catches "this protocol implementation is subtly wrong." So that's a real frontier. The other direction I'd watch is the same template applied to different rigorous tools. If LLMs can write symbolic-execution harnesses, they can probably write fuzzing harnesses, formal-verification setups, property-based test specs. Each of those communities has been bottlenecked on expert human setup labor for the same reason. The harness problem isn't unique to symbolic execution.
31:16Hope: And the question for security teams reading this paper today is which of their codebases look like mupdf and FFmpeg — locally testable, structurally accessible — and which look like curl and OpenSSL, where the architecture demands a different approach. That triage is genuinely actionable.
31:36Eric: The paper isn't claiming a universal solution. It's claiming that a previously-unattainable category of bug is now findable cheaply. That's a real shift, and the three zeros tell you exactly where the shift doesn't apply yet.
31:52Hope: That feels like the right place to wrap. The paper is "Guiding Symbolic Execution with Static Analysis and LLMs for Vulnerability Discovery," from the UC Santa Barbara team, posted in early April. The show notes have a link to it and to related materials — worth a read if any of this caught you.
32:13Eric: This has been AI Papers: A Deep Dive. Thanks for listening.