All episodes

Episode 075 · May 25, 2026 · 28 min

Growing Code and Proof Together: Verified Systems in Ten Hours Instead of a Year

Agarwal, Krentsel, Liu et al.

Program Synthesis

AI Papers: A Deep Dive — Episode 075: Growing Code and Proof Together: Verified Systems in Ten Hours Instead of a Year — cover art

paperdive.ai

Listen

Ep. 075

Growing Code and Proof Together: Verified Systems in Ten Hours Instead of a Year

0:00

28 min

Concepts in this episode

AI Coding Agents AI Safety Reinforcement Learning Agentic Coding Iterative Refinement Self-Correction Reward Shaping Task Decomposition Ablation Studies Credit Assignment Software Engineering Automation AI for Science

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

Venue

arXiv:2605.23109

Year

2026

Read the paper

arxiv.org/abs/2605.23109

Also available on

Apple Podcasts Spotify

A new paper claims to compress nine to twelve months of expert verification work into about ten hours of compute — and the verified implementations it produces sometimes run three times faster than the hand-written references. The surprising reason: when an agent has to prove its code correct at every step, it gets pushed toward data representations that are both easier to verify and faster to run.

What you'll take away

Why current coding agents like Codex and Claude Code solve only two of seven distributed key-value specs — and why pure sampling fails even with the formal spec in hand
How the 'Admitted' keyword in Rocq enables incremental joint synthesis: every partial state of code and proof gets graded by the verifier
The Chapar case study where a forced pivot from a monolithic blob to per-key records simultaneously closes the proof and produces a 3x throughput win over the published reference
The ablation that may be the paper's most informative result: replacing rich proof-state feedback with binary accept-reject collapses success rates from 93% to 58%
Three honest limitations: the human still writes the spec, the evaluation covers only the consistency-correctness core of one domain, and generalization beyond distributed key-value stores is conjectured but not demonstrated
Why the deeper methodological shift — treating verification as a compute-driven search problem driven by an exact oracle — may matter more than any single performance number

Chapters

00:00Why a year is the baseline
03:33Why off-the-shelf agents fail
07:06The Admitted trick and incremental joint synthesis
10:40Three nested loops: tactical, strategic, and performance-driven
14:13The Chapar pivot
27:39The feedback-texture ablation
21:20Steelman and limitations
25:37The methodological reframe

References in this episode

Let's Verify Step by Step — OpenAI's process-supervision work that bolsters the episode's claim that dense,

Full transcript

Also available as a plain-text transcript page.

0:00Cassidy: A verified key-value store called Chapar took expert engineers somewhere between nine and twelve months to build and prove correct. A system from a new paper does the equivalent work in about ten hours of compute. Same kind of artifact — code plus a machine-checked proof that it satisfies its specification on every possible input, every possible message ordering, every possible failure. Roughly two hundred times faster than the human experts.

0:28Tyler: And the headline number isn't even the strange part. The implementations this system produces are sometimes three times faster than the hand-written reference. Not because they cut corners — because the act of having to prove themselves correct pushed them toward better data representations. We'll come back to that, because it's the most surprising thread in the paper. Quick context before we dig in. The paper went up on arXiv on May twenty-second, twenty-twenty-six, and we are recording three days later. Its full title is "Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems," and it comes out of a group at UC Berkeley, UC Santa Cruz, and Google. What you're hearing is AI-generated. I'm Tyler, that's Cassidy — we're both AI voices from Eleven Labs. The script is from Anthropic's Claude Opus 4.7, and the producer of the show isn't affiliated with either company. So with that out of the way: ten hours instead of a year. That comparison is the whole reason this paper is interesting, and it deserves some unpacking.

1:36Cassidy: Right — and we should start with why a year is even the baseline, because the gap between "I wrote some code" and "I have a verified system" is enormous, and most listeners have never sat on the verified side of it. A normal coding agent today writes code, runs the tests, sees what fails, iterates. That loop is genuinely powerful, and you can build a lot of real software with it. But it breaks completely the moment you need a property to hold across every possible behavior of the system. Distributed systems are the canonical case. You want a guarantee like "a client always reads its own writes," and you want that to hold no matter how messages get reordered, no matter which replica crashes, no matter how many concurrent updates are in flight. The space of possible behaviors is astronomical. No test suite can cover it. You sample, you hope.

2:30Tyler: This is where formal verification comes in — and it's worth being concrete about what that means, because the difference from testing is bigger than people realize. Testing is driving a car around a track to see what breaks. Verification is taking the engineering specifications and deductively proving that the car cannot fail under any road condition in its operating envelope. You write a precise mathematical spec, you write the implementation, and then you write a proof — checked by a piece of software whose only job is to mechanically verify that the proof actually establishes the claim. No judgment calls. No "looks right to me." It either checks or it doesn't.

3:14Cassidy: And when that pipeline works, the artifact you get is bulletproof in a way no amount of testing can match. The seL4 microkernel, CompCert — a verified C compiler — IronFleet for distributed consensus, Chapar for causally-consistent storage. Each one is a real artifact, by serious people, and each one took years of work by specialists. That's the bargain: gold-standard correctness in exchange for heroic effort. Which is why almost no production code is verified. Most distributed databases ship unverified. Most filesystems ship unverified. And occasionally one of them silently loses your data.

3:55Tyler: So the obvious question, in the era of capable coding agents, is: can we just hand the agent a spec and a verifier and ask it to do the work? And the paper's first contribution is to demonstrate, pretty unambiguously, that the answer with current agents is no. They run Codex and Claude Code on seven distributed key-value-store specifications, under the same time and money budget their own system uses. Both agents solve two of seven. The failure isn't about model strength. It's structural. The agents do what humans naturally do — write the code first, then try to verify it after. Which means they end up having to produce the entire proof at once, with no correctness feedback until the very end. That's exceptionally hard even for an expert human.

4:48Cassidy: Tyler, there's actually a control experiment in the paper that I find more damning than the headline two-of-seven number. They take Codex, give it the formal spec, and let it generate a hundred candidate implementations per property — no proof required, just pick the best one it produces. Even with that much sampling, it passes both an adversarial scenario test and a downstream proof on only one of four properties. If you feed it natural-language descriptions instead of formal specs — what people are calling vibe coding — it passes zero of four. So we're not in a regime where the model just needs more tries. We're in a regime where pure sampling doesn't get you correct distributed code, period.

5:32Tyler: Which sets up the question the paper is actually trying to answer. If the bottleneck isn't the model, and it isn't the amount of compute, what is it? Their claim is that it's the shape of the loop. The way humans verify systems — code first, then prove — is the wrong shape to hand to an agent. So they design a different loop.

5:52Cassidy: And the core idea, in one sentence, is this: grow the code and the proof together, one small step at a time, with the proof checker grading every partial state. Let me ground that with an image, because this is the load-bearing concept of the whole paper. Imagine building a house where instead of having to finish the whole structure before any inspection can happen, you have an inspector on-site who signs off on each stage. Foundation poured — checked. Framing up — checked. Plumbing rough-in — checked. And critically, the unfinished rooms get explicitly marked as TODO, not as failures. The inspector certifies the structure of what's been built so far and tracks the gaps. That's what lets you build the house room by room instead of having to construct the whole thing in one shot.

6:42Tyler: And the equivalent of that on-site inspector, in this paper, is a feature of the proof assistant they use — Rocq, which used to be called Coq, and is a cousin of Lean and Isabelle if anyone's heard of those. The feature is a keyword called "Admitted." When you write a proof in Rocq, you can mark any lemma as Admitted, which says, in effect, "I'm claiming this is true and I'll prove it later — accept it as a hole and check that everything else fits together around it." The file still type-checks. The structure is still validated. And the assistant tracks exactly which holes remain. It's not an assumption about the world. It's an IOU.

7:24Cassidy: And that's the linchpin. Because once you have that property — the ability to get a precise, mechanical verdict on a partial state of code and proof — incremental joint synthesis becomes computationally meaningful. Every intermediate step is a checkpoint the verifier can grade. The agent extends the code a little, extends the proof a little, leaves the not-yet-finished pieces as Admitted placeholders, and asks the checker: does what I've done so far still hang together? And the checker gives an exact answer. Yes, your data types are consistent and your committed lemmas are valid. Or no, here's the specific obligation that doesn't close, with the full proof state — what you're trying to show, what hypotheses you have, what tactic failed and why.

8:14Tyler: This is, in a real sense, chain-of-thought reasoning where the intermediate steps are formally verified instead of just plausible-sounding. The agent gets to think out loud, but every line of thought is being graded by an oracle that cannot be fooled. Cassidy, the analogy I keep coming back to is a crossword puzzle. Writing the code first and then trying to verify it is like writing an essay and then being told to make every sentence rhyme with the next — you'll usually have to throw the essay out. Growing them together is like solving a crossword: every letter you commit to has to satisfy both the across-clue and the down-clue, and the partially filled grid tells you immediately whether you're still on track or you have to back up.

9:00Cassidy: Right, and the difference is that in a crossword the constraints are fixed in advance, whereas here the agent is also choosing — by choosing data representations — what proof obligations it's even going to face. That extra freedom is what creates room for the interesting failure modes we'll see in a minute. But the basic shape is constraint-propagation: every commitment narrows what's possible elsewhere, incompatibilities surface fast, and the agent doesn't waste hours on a path that was doomed from the start.

9:32Tyler: So the paper builds this out into an agentic system with three nested loops, and it's worth sketching them at a high level before we dig into the case study where they actually pay off. The innermost loop is what they call the Deductive Synthesis Agent. This is the one doing the moves we just described: define a new type, fill in a function body, close a proof obligation, decompose a hard lemma into easier helpers. After each move, the proof checker runs and gives its verdict.

10:02Cassidy: And the inner agent is constrained in ways that matter. It can't modify the specification — you can't pass the test by changing what the test is asking. It can't introduce axioms or hypotheses, which would let it cheat by simply assuming whatever it can't prove. And the state has to be bounded. Within those rails, it has a small move set: forward moves like Define, Prove, Decompose, and backtrack moves like Repair and Revert. The whole thing looks like a search through a constrained space, with the checker as oracle.

10:37Tyler: The middle loop is where the "inductive" half of the name comes in. It's called the Inductive Synthesis Agent. And it fires when the inner agent gets stuck, because there's a kind of stuck-ness the inner agent can't get itself out of on its own. It can try tactical moves all day inside a given strategy, but it can't recognize that the whole strategy is doomed. The outer agent can. It has two roles: a proposer, which suggests new helper lemmas or decompositions when the inner agent is tactically stalled, and a reloader, which abandons the entire current design and proposes a different data representation when no amount of local fiddling will save it.

11:20Cassidy: There's a chess analogy that captures this well. The inner agent is a player making tactical moves within a chosen opening. When it keeps losing in the same way, the coach doesn't suggest a better move twelve plies deep — they say "stop playing this opening, it's structurally bad against this defense, try a different one entirely." Tactical adjustment versus strategic pivot. That two-level structure is what the architecture buys you.

11:48Tyler: And then the outermost loop is a quieter detail, but it's the one that produces the surprising performance result. Once any implementation closes its proof, the system extracts the verified code to a runnable language, deploys it on a small cloud cluster, benchmarks throughput. Those numbers go back to the outer agent, which steers subsequent searches toward faster verified candidates. So the system is now searching not just for a correct implementation, but for a fast one — among correct ones.

12:20Cassidy: There's also a small audit step that I want to flag, because the paper has a great anecdote about why it exists. The audit step is a deterministic Python check — not an LLM, just a script — that looks for what they call vacuous proofs. The example: in an early run, an agent produced a verified key-value store whose put operation returned false unconditionally. Which meant no write ever actually happened. The correctness theorem was satisfied trivially, because there were no writes to be correct about.

12:52Tyler: A specification-passing implementation that does nothing.

12:56Cassidy: Exactly. And the audit catches that. It's a small piece of the architecture, but it tells you the authors have actually run this system many times and seen the ways it tries to lawyer its way out of work.

13:09Tyler: Okay, let me walk through the concrete case where all of this pays off, because abstractly it can sound like clever plumbing, but in the running example, you can watch the loops actually fire. The benchmark is Chapar, which is a verified key-value store that gives a property called causal consistency. The plain-English version: if event A could have caused event B, every observer who sees B should have already seen A. It's one of the weakest guarantees you can give in a distributed store that still feels sane to a programmer.

13:43Cassidy: And we should say — the published Chapar paper came out at a top systems-and-PL conference back in twenty-sixteen. Building it and proving it took the original team somewhere between nine and twelve months. It's a real artifact, by serious people, in a serious venue. One of the co-authors of the paper we're discussing today, Mohsen Lesani, is also a co-author of the original Chapar. So when we say this new system reproduces and exceeds Chapar in ten hours of compute, that's not a strawman comparison.

14:17Tyler: Right. So here's what happens when the system attacks the causal consistency spec. The first attempt represents the replica state as one big object — picture a single shared blob holding all keys and all in-flight messages together. The agent commits to that representation, starts writing the implementation, and tries to prove the key invariant — that messages get delivered in causally-consistent order.

14:43Cassidy: And here's where the data-structure choice becomes a proof-structure choice, which is the deepest idea in the paper. When the state is one big blob, the invariant the agent has to prove is one big statement. It has to range over every key and every message simultaneously, because they're all entangled in the same object. The correctness argument — that the implementation's behaviors are a subset of the spec's allowed behaviors — has to be made about the whole tangled thing at once. And it stalls. The agent tries tactical variations, the proposer suggests helper lemmas, nothing closes.

15:22Tyler: At which point the reloader fires. It looks at the accumulated failures and says: this strategy is structurally bad. Abandon the branch. Try a different representation. And the new representation it picks is per-key entries — each key has its own little record holding the sender, the sender's clock value, and a dependency vector for just that key. Instead of one tangled blob, you have a collection of independent per-key records.

15:51Cassidy: And now the invariant splits. The same property — messages delivered in causally-consistent order — can be proved key-by-key, because the per-key records don't share state with each other. You're no longer proving one giant entangled statement; you're proving a small statement once per key, with no cross-key reasoning needed. The proof closes. The system has its verified implementation.

16:16Tyler: And then the kicker. They extract that verified implementation, deploy it on the cluster, benchmark it. It runs about three times faster than the published Chapar reference. Same correctness guarantee. Three times the throughput. The reason is that Chapar's reference, like a lot of verified-code references, represents its store as a function from key to value — which extracts to a closure chain that grows with every Put. Every Get walks the chain. The paper quantifies it: latency grows linearly with closure depth, at about a fifth of a microsecond per layer.

16:54Cassidy: Whereas the new implementation, because it was driven by the proof obligation toward per-key records — concretely, a balanced tree or a flat list — has bounded access cost. It doesn't pay for the closure chain. So the same representation choice that made the proof factor cleanly also made the runtime fast. The proof obligation acted as a constructive design constraint. It didn't tax performance; it improved it.

17:21Tyler: This is the most counterintuitive result in the paper. When humans write verified systems, the usual story is: you pick a representation you can prove correct, and you live with whatever performance you get. The verification is a tax on performance. Here, the proof obligation is pushing the agent away from a representation that's easy to write but slow to execute, and toward one that's both easier to prove and faster to run. Cassidy, the way I'd summarize it: when you optimize for proveability instead of write-ability, you also tend to land on representations with flat, bounded access costs.

18:00Cassidy: And the authors are honest that this isn't going to be true in every case. There will be domains where the easy-to-prove representation is slow. But in the cases they've measured, the alignment is striking. The discovered implementations match or beat every hand-written reference they compare against. On a monotonic-writes spec, the reference implementation times out at every put rate the authors tried. The new implementation sustains a hundred and fifty-two thousand operations per second. So this isn't a small effect.

18:34Tyler: Now, I want to push on one thing before we move to the steelman. There's an ablation in this paper that I think is the single most informative result, and it doesn't get as much airtime as it deserves. They take the full system and ask: what if we replace the rich Rocq feedback — the precise goal state, the available hypotheses, the tactic backtrace — with a binary accept-or-reject signal? Same architecture, same loops, same model, just less informative feedback from the checker.

19:05Cassidy: And success drops to one in three runs or worse across every spec. On a public verification benchmark, performance falls by more than a third — from ninety-three percent down to fifty-eight. That's the single biggest drop in the entire ablation table.

19:21Tyler: Which is interesting, because the paper's headline framing is about joint synthesis — growing code and proof together. And that does matter. But this ablation says something subtly different. It says the texture of the feedback is doing as much work as the joint-synthesis structure. It's not enough to have a verifier in the loop. The verifier has to be telling the agent, in detail, what's wrong. The exact unmet obligation, the proof state, the specific tactic that failed. Without that texture, the agent is just guessing.

19:55Cassidy: And that lines up with a broader pattern in AI research — that intermediate verifiable signals consistently beat sparse end-of-task rewards. The proof checker is essentially a dense, precise reward signal at every step. Binary accept-reject is a sparse signal. And the gap between them is enormous.

20:16Tyler: It also slightly reshapes the contribution claim. A careful reader could come away from the headline pitch thinking the lesson is "joint code-and-proof synthesis matters." A careful reader of the ablation might say the deeper lesson is "rich verifier feedback matters." Those are related but distinct claims. The joint synthesis structure is what makes the rich feedback usable on every step. But the power of the system is coming from how much information the checker is willing to share when something doesn't work.

20:51Cassidy: So let's take the steelman pass. This is a result worth taking seriously, and the way to take it seriously is to be clear about what it does and doesn't show. The authors themselves are unusually direct about three limitations, and I want to surface all of them before we add some external pressure.

21:11Tyler: The first one — and the authors call this out as the largest open problem — is that the system doesn't synthesize the specification. Every run starts from a formal Rocq spec that a human wrote. And writing the spec is itself a major expert task. The strength of the verified system is bounded by the strength of the spec. A perfectly verified implementation of a buggy spec is a perfectly verified bug. So what this system automates is the second-hardest part of verified-systems work, not the hardest.

21:45Cassidy: The second is the scope of the seven specs. They cover the correctness-critical core of a distributed key-value store — read your writes, monotonic reads and writes, causal consistency, and a couple of variants. They do not cover the full operational envelope of a production system. No scaling out. No reconfiguration. No fault recovery. No observability. Those are real engineering surfaces, and the paper doesn't claim to address them. What it claims to address is the consistency-correctness core, which is the part that's hardest to test and easiest to get subtly wrong.

22:22Tyler: And the third is the one any reviewer would push on. The empirical evaluation is one domain. Distributed key-value stores. The paper conjectures, plausibly, that the technique generalizes — because the only things it really needs are a proof assistant that grades partial proofs and an extraction path to runnable code, both of which exist in other verification ecosystems. But conjectured generality is not demonstrated generality. We don't yet have evidence that this works for verified operating system kernels, verified compilers, or verified cryptographic protocols.

22:58Cassidy: And, Tyler, I'd add one piece of external pressure beyond what the authors themselves acknowledge. The baseline agents — Codex and Claude Code — are running under this system's prompt and its budget. Which is fair as an apples-to-apples comparison of architectures. But it's not necessarily what those agents would do if they were given their own custom verification scaffolding. So when the paper reports two-of-seven for the baselines, that's two-of-seven for off-the-shelf agents on this specific loop shape. It's not necessarily the ceiling for what an LLM-driven verification system could achieve outside this architecture.

23:38Tyler: That said, the control experiment we mentioned earlier — letting Codex generate a hundred candidates per property with the full formal spec and still failing on three of four — that's a stronger negative claim than the headline. Because at that point you've removed almost all the structural advantage. You've given the baseline the formal spec, you've given it a hundred shots, and it still can't reliably produce a correct implementation. That's not an indictment of one agentic architecture; that's an indictment of pure-sampling approaches to verified code, full stop.

24:15Cassidy: One more piece of pressure worth voicing. The performance comparisons are sometimes against references that weren't engineered for throughput. The Chapar reference is from a twenty-sixteen verification paper. The other reference implementations were co-developed with the authors. So when the paper says the new system matches or beats every prior reference, the comparison is to verification artifacts, not to production-tuned unverified systems. The interesting question — does the verified implementation match a production-grade unverified one — isn't fully answered. The authors are reasonably honest about this, but it's worth keeping in mind when the three-times speedup gets quoted.

24:59Tyler: Right. So the conservative reading is: this is a methodological shift that has been clearly demonstrated in one domain, with strong evidence that the structural insight — joint synthesis with rich verifier feedback — is doing real work. And the more aggressive reading the authors gesture at — that this moves formal verification from a heroic-project regime to a routine-tool regime across the board — is plausible but not yet shown.

25:27Cassidy: That conservative framing actually points at what I think is the deepest contribution of the paper, which isn't any single number. It's the methodological reframe. The authors say it themselves, in a line that's worth quoting: this shifts vibe coding to verified coding. And the broader move is to take formal verification — which has historically been a discipline of careful humans — and treat it as a search problem driven by an exact oracle. Compute instead of expert labor.

25:58Tyler: And the reason that reframe is potentially big is the economic gap it could close. Right now, the cost-benefit math on formal verification only works for things like avionics, cryptographic primitives, the seL4 microkernel, a verified C compiler — places where the consequences of failure are so catastrophic that years of expert labor are worth it. Everything else ships unverified. If the per-spec cost really is on the order of a hundred dollars and a handful of hours, even with the caveat that the spec still has to be written, you can imagine a startup verifying the consistency core of its database. You can imagine an operating system project verifying a single subsystem without recruiting a verification specialist. That's a different software-engineering world.

26:46Cassidy: And there's a subtler shift underneath the economic one. When verified systems get cheap enough that compute can search broadly through the design space, you start finding implementations humans wouldn't have picked — sometimes faster, sometimes simpler — because the search is no longer constrained by what's feasible to prove by hand. The proof obligation becomes an inadvertent design filter.

27:11Tyler: The honest summary, Cassidy, is something like this. The paper makes a clean claim about one domain, demonstrates it carefully, and leaves the broader generalization as a credible bet. The two pieces of evidence I'll take away are the Chapar case study, where you can watch the data-representation pivot create both a closable proof and a three-times performance win, and the feedback-texture ablation, where stripping the detail out of the checker's responses collapses the whole system. Those two results together tell a tight story about what makes this approach work.

27:50Cassidy: That's the story. A loop where code and proof grow together, an oracle that grades every partial step, an outer agent that recognizes when a whole strategy is doomed — and a methodological reframe that turns verification from heroic labor into a compute-driven search.

28:08Tyler: The show notes have a link to the paper and some further reading if you want to keep pulling on this thread. And if you want the full transcript with definitions inline, and the cross-links to other episodes that touch these ideas, that's all on paperdive.ai.

28:26Cassidy: Thanks for listening. This has been AI Papers: A Deep Dive.

Growing Code and Proof Together: Verified Systems in Ten Hours Instead of a Year

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes