All episodes
Episode 027 · May 08, 2026 · 30 min

When AI Agents Build the Serving Stack: A Bet on Bespoke Infrastructure

Kamahori, Li, Peter et al.

AI Papers: A Deep Dive — Episode 027: When AI Agents Build the Serving Stack: A Bet on Bespoke Infrastructure — cover art
paperdive.ai
Ep. 027
When AI Agents Build the Serving Stack: A Bet on Bespoke Infrastructure
0:00
30 min
Paper
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
Venue
arXiv:2605.06068
Year
2026
Read the paper
arxiv.org/abs/2605.06068
Also available on
Apple Podcasts Spotify

What if the reason we use general-purpose serving frameworks like is just that bespoke ones used to be too expensive to write? A new paper points a team of coding at LLM serving and gets bespoke runtimes that match vLLM on its home turf and beat it by 2x — even 6x — on long-tail workloads it wasn't built for. We dig into whether the design-space bet actually holds up.

What you'll take away

  • Why 'generation-time specialization' revives an old systems argument (, ) that was settled by economics rather than principle
  • The two-loop architecture — durable git/issue/memory state outside, role-separated Implementer/Judge/Evaluator agents inside — and why splitting roles structurally prevents an agent from talking itself out of correctness
  • How a bespoke stack beats -with-speculative-decoding by 2x on code-editing workloads by using the user's input file as the draft
  • Why the -on-a-MacBook result (6.27x over PyTorch, within 7% of a kernel-perfect ceiling) is the cleanest demonstration of the long-tail argument
  • The real limitations: single-seed runs, a user-supplied correctness checker that's a quality bar not a proof, and a skills library that blurs 'specialization' with 'automated porting'
  • Why the paper's lasting contribution may be the architecture itself, not the speedup numbers

Chapters

  1. 00:00The design-space bet
  2. 03:21Keeping a long-horizon agent coherent
  3. 06:42Separation of powers in the inner loop
  4. 10:03Scenario B: predicted outputs for code editing
  5. 13:24Scenario C: hybrid SSM/attention models
  6. 16:45Scenario A: parity on vLLM's home turf
  7. 20:06Scenario F: Show-o2 on a MacBook
  8. 23:27The steelman: where the claims could break
  9. 26:48What actually generalizes

References in this episode

Also available as a plain-text transcript page.

0:00Bella: Imagine you want to deploy a model — let's say one of those new multimodal ones that interleaves text and diffusion image steps in a single — and you want to run it on a MacBook. You go shopping for a serving framework. doesn't support diffusion paths. doesn't include this model. The reference implementation is research-grade PyTorch that wasn't built to be served. There is, in a real sense, no general-purpose answer. The paper we're digging into today asks: in that situation, what if you just pointed a team of AI at the problem, and they wrote you a custom serving stack from scratch in fourteen hours — and what if that bespoke stack ran more than six times faster than your baseline?

0:48Eric: The paper is ": Can AI Agents Build Bespoke LLM Serving Systems?" — out of the University of Washington, posted to arXiv on May seventh, twenty-twenty-six, and we're recording the day after. What you're hearing is AI-generated. I'm Eric, and Bella and I are AI voices from Eleven Labs. The script is from Anthropic's , and the producer isn't affiliated with either company. The reason that one-day gap is worth flagging is that the paper itself is arguing for a fairly aggressive shift in how infrastructure software gets built — a shift in which the speed of synthesis is the whole point. So the fact that we can have this conversation a day later is sort of in the spirit of the thing.

1:35Bella: Right. And the bet at the heart of this paper is the kind of bet I find most interesting, because it's not really an empirical bet, it's a design-space bet. So let me set it up. For about a decade, the way we've built serving infrastructure for any new technology has followed one pattern. A small number of general-purpose frameworks emerge, they get hand-tuned over many engineer-years, and they become the default. For LLM inference today, that's , , . Each of them is a triumph of engineering. And each of them is shaped by what's *common*: dense decoder-only , NVIDIA GPUs, generic chat workloads.

2:19Eric: So they're optimizing for the median deployment.

2:23Bella: Exactly. Which is the right move when per-deployment engineering is expensive — you build one really good general thing and you eat some inefficiency at the edges. The problem is the edges are getting bigger. New model families — hybrid , multimodal models that mix text and images in weird ways. New hardware — Apple Silicon's unified memory, custom accelerators. New workload patterns — code editing where you can predict most of the output, with massive shared prefixes, streaming speech. The general framework either runs these badly, or doesn't run them at all, or needs substantial new engineering for each one. The framework's maintainers can't keep up, structurally, because they're aiming at the median.

3:13Eric: And here's the conceptual move. There's a classic systems-research idea — *specialize aggressively for each deployment* — that's been intellectually attractive forever. Exokernels in the nineties, in the twenty-tens. They make eloquent cases that generality has a tax: every layer of abstraction the framework needs to handle every possible deployment is overhead you're paying even when you don't need the flexibility. And those projects mostly didn't ship. Not because the argument was wrong, but because per-target engineering cost dwarfed the gains. You couldn't afford to write a custom kernel for every server in your fleet.

3:55Bella: And the bet this paper is making — the part that's worth taking seriously — is that AI coding have just changed the math on that. If a custom system that used to cost an engineer-year now costs a long afternoon of compute, then a bunch of design-space arguments that were settled in favor of generality come back open. The question isn't "is bespoke better than general" — bespoke was always better in principle. The question is "did we just become able to afford it."

4:26Eric: The off-the-rack-versus-tailored framing is the cleanest way I can think about this. Off-the-rack suits have to fit a wide range of body shapes — they're cut conservatively, with adjustment seams that add bulk. A bespoke suit is cut for exactly one person. No extra fabric, no compromises, no adjustability you don't need. Today's serving frameworks are off-the-rack with options. The paper is arguing that with AI tailors, bespoke just got affordable for everyone.

4:57Bella: And they call it, in one of the better one-liners in the paper, *generation-time specialization rather than runtime generality*. That's the thesis. When does specialization happen? Today, at runtime — a general engine has fast paths and configurations that turn on for particular models. Tomorrow, maybe, at generation time — when you deploy, an loop writes you a runtime exactly tailored to your situation, with no abstraction tax because there's no second deployment to be compatible with.

5:31Eric: Okay. So the bet's on the table. Now we need to ask whether the mechanism actually works.

5:37Bella: Right. And the mechanism is where the engineering substance is. Because building a whole serving system end-to-end is not what most coding work has been doing. Most of the agentic optimization research targets a small surface — a single GPU kernel, a marked region of code, a single scheduling policy. Whole-system synthesis is different. It's multi-file, multi-component, and the right next move depends on which component is currently the bottleneck — which itself shifts as you optimize. A scalar fitness score, like in evolutionary search, can't encode that. A single conversation thread hits context limits within hours. And summarizing-and-starting-fresh — what people call — loses crucial detail and the agent drifts.

6:27Eric: This is the central pathology of long-horizon coding in general, right? Context windows are finite. Every approach to extending them — truncation, summarization, fresh starts — costs you something. You either lose detail, you drift, or you forget what you tried.

6:45Bella: Yes, and the architecture in this paper is essentially one specific answer to that pathology, tailored for system synthesis. There are two nested loops with very different state characteristics. The outer loop is the planner. It has rich, persistent state — git commit history, an issue backlog, a long-term-memory markdown file. That state survives across rounds, across context resets, across everything. The inner loop has three specialized that work in fresh on focused tasks. Implementer, Accuracy Judge, Performance Evaluator. They never share a context. They hand off through artifacts.

7:26Eric: The git-as-memory move is the elegant part for me. Because every accepted code change is a git commit, and the outer planner reads from a structured backlog of issues, the "what should we try next" reasoning is anchored in durable structured artifacts, not in a chat transcript. Think of it like a long surgical operation. Surgeons can change shifts because there's a written record — what was done, what was tried and reversed, what worked. The next shift reads the record, not the previous surgeons' memories.

8:00Bella: And there's a subtle detail in the file that I want to flag, because it's load-bearing. The orchestrator needs to be able to distinguish *"this technique didn't work for this workload"* from *"the implementation was buggy."* Because without that distinction, an that fails to land an optimization once will either keep retrying it forever or abandon it forever — and neither is right. So the memory has to encode not just what happened, but why.

8:31Eric: Tell me about the inner trio. Why three roles and not one?

8:36Bella: This is the part of the design I think is genuinely insightful. The argument is essentially a separation-of-powers argument. In a courtroom, the prosecutor, the defense, and the judge don't huddle and negotiate — they have separate roles, separate information, structured handoffs. If you collapsed all three into one person, that person would have incentives to cut corners on the parts of their job that conflict with each other. The Implementer wants the optimization to land. The Accuracy Judge wants correctness — it runs the user's correctness checker against the reference implementation, and crucially, it inspects the diff for patterns. Things like "did you build a prompt-keyed completion cache that just memorizes the answers" or "did you add a fast path that bypasses inference entirely." Only after the Judge passes does the Performance Evaluator profile, drill down with platform tools, and emit performance hints.

9:35Eric: And the clever bit is that performance reasoning never overrides correctness reasoning, because they happen in different contexts. The Judge can't be talked out of its standards by an Implementer mid-conversation, because there is no conversation. There's just artifact handoffs.

9:53Bella: Right. A single doing all three jobs has incentive to relax its own correctness criteria when an optimization is hard to land. Splitting the roles into independent contexts removes that pressure structurally — not by trusting the agent to be honest with itself, but by making the dishonesty mechanically impossible.

10:14Eric: There's one more piece to the architecture worth naming. The skills library. They use Anthropic's "Agent Skills" format — focused chunks of expertise the can retrieve. The library distills knowledge from existing serving engines, from the research literature, from hardware quirks, from profiling tools. So when the Implementer is wiring up, say, a paged , it's not deriving the design from first principles. It's pulling a skill entry that summarizes how does it.

10:44Bella: And the practical implication is that adding support for a new model family or a new accelerator is now a content task — write a skill — rather than a code task — modify the framework. Which is part of why they think this approach scales beyond what hand-engineered runtimes can cover.

11:02Eric: Bella, this is also the part of the design where I think a skeptical reviewer would push hardest, and we should come back to it. But let's see the empirical evidence first. Because the architecture only matters if the bespoke systems actually beat the general ones.

11:19Bella: Yes. And the cleanest teaching example — the one that made the bet feel real to me — is what they call Scenario B. Code editing with . So here's the setup. You're using something like . You ask it to make a small change to a file. The model needs to output the *modified* file, but most of the modified file is going to be identical to the input file — most edits are local. Now think about how a normal serving system handles this. It generates the output by token, even though the model is going to spit out hundreds of tokens that it just saw in the input. That's enormous wasted work.

11:59Eric: The fix is a variant of . The standard version of speculative decoding has a small *draft* model propose several cheaply, and the big *target* model verifies them all at once in one batched pass. If the draft was right, you got several tokens for the cost of one big . If wrong, you fall back to normal decoding. The win is that GPU forward passes are massively parallel — verifying ten tokens at once is barely more expensive than generating one.

12:31Bella: And the predicted-output variant is the same idea, but the user supplies the draft. The user already has a near-copy of the answer — the input file. So you skip the draft model entirely. There's no draft compute at all. You just take the user's , chunk it into blocks, and have the target model verify each block in a single batched pass. Where the prediction was right, you keep the . Where it was wrong, you regenerate that stretch normally.

13:03Eric: Think of it like proofreading a colleague's document. You could read every word slowly and decide whether to change it. Or you could assume most of the text is fine, do a fast scan, and only slow down where you spot something off. The "fast scan" is the verification batch. The "assume the text is fine" is the user-supplied draft.

13:25Bella: And the iteration for this scenario is the kind of thing that makes the loop feel real. Iteration two: the agent adds — basically pre-recording GPU operations so they can be replayed without re-launching each call. That alone gets it to one-point-three-five times faster than vanilla decoding. Iteration three: it implements the predicted-output in sixteen- blocks. That jumps to two-point-nine times. Then a long stretch of tuning. By iteration fourteen, blocksize tuning gets it to almost six times faster than vanilla — and crucially, two times faster than with conventional .

14:11Eric: The two-times-faster-than--with-speculative number is the one that does the most work for me. Because vLLM *has* . They're not comparing against an unoptimized baseline. They're comparing against an optimization that requires running a draft model — and beating it by a factor of two by using the user-supplied draft instead.

14:34Bella: Right. The headline isn't "we beat the baseline by being clever." It's "we beat the *clever* baseline by being clever in a way the general framework couldn't be."

14:45Eric: Let me take the next one — Scenario C — because the win there is a different shape and I think it's worth the contrast. This is . There's a recent architecture trend where most layers in a model are or linear- layers, with a fixed-size , and only a few layers are full attention. Models like , -H, Olmo-Hybrid. The motivation is that full attention is expensive and most of what attention is doing can be done more cheaply. The serving challenge: in a normal , the model's working memory for an in-progress conversation is the , which grows linearly with sequence length. In a hybrid model, only some layers have that. The other layers carry a fixed-size recurrent state that updates as stream in. So when you want to do prompt caching — sharing a long prefix across many requests, which is huge for things like — you need to share *two different kinds of cache* in parallel. The KV cache for the attention layers, and the recurrent state for the SSM layers.

15:57Bella: And , as far as I can tell from the paper, doesn't share the efficiently across requests — first-class hybrid-KV support is recent and limited, and sharing the recurrent state requires snapshotting at prefix boundaries, which incurs significant memory overhead. So in practice, if you have a thirty-two-thousand- shared prefix and a hundred requests, you end up either recomputing that prefix or paying a heavy memory cost per snapshot. Either way, enormous waste.

16:31Eric: The bespoke system implements both caches in parallel, properly synchronized, and gets a three-point-four-five times throughput improvement on a thirty-two-thousand- shared-prefix workload. But I want to flag the iteration story here, because it's *less* clean than Scenario B and that's actually informative. Iterations one through six — six full rounds — fail the accuracy gates. The is wiring up the dual cache and getting subtle correctness bugs that the Judge catches. Iteration seven finally clears with continuous batched decode, getting two-point-four-five times. Iteration nine adds , gets to three-point-two-five.

17:17Bella: The six failed rounds matter. They're evidence that the Judge is doing real work. If correctness were trivially passing, you'd see speedups land on iteration one and stay landed. The fact that the Judge keeps sending it back means the role separation is catching genuine bugs that an Implementer-Judge-merged might have shipped.

17:40Eric: Right. That's the structural story. Now, before we go to the closer, we should mention the standard-setting result, because it's actually load-bearing for the whole argument.

17:53Bella: Scenario A. The steel-man test. -three-point-one-eight-B on an — the most standard, most commodity LLM serving deployment in the world. The case was *built* for. The question is: can the bespoke approach even match the general framework on its home turf?

18:12Eric: And the answer is yes. The generated system reaches parity with and beats by about five percent on throughput. Which sounds boring but defuses the most obvious objection: that bespoke systems trade reliability or quality for speed. They don't, at least on this case. The other thing in Scenario A worth mentioning is that the four request rates they tested at — eight, thirty-two, sixty-four, and a hundred-twenty-eight requests per second — were *not* pre-specified. The kept escalating to harder loads on its own after plateauing at easier ones. It basically self-administered a curriculum.

18:57Bella: That's a small detail but I love it. The didn't know what "good" meant, found a level it could solve, and kept raising the bar.

19:07Eric: Now — the closer. Scenario F. on a MacBook. This is the one where the long-tail argument stops being abstract. Show-o2 is a unified vision-language model. It does text generation , like a normal , but it does image generation through diffusion steps, all interleaved in a single . There is no general-purpose serving stack that runs this. doesn't support diffusion paths. There's a vLLM-Omni variant that handles some multimodal models but not this one. The reference implementation is research-grade PyTorch.

19:48Bella: So the comparison isn't "can beat the optimized baseline." The comparison is "can VibeServe make this run at all, well."

19:58Eric: And on a MacBook the speedup is six-point-two-seven times over the PyTorch baseline. They get within about seven percent of a theoretical ceiling — what they call "fp16 kernel-perfect" — meaning if you could replace every operation with a perfectly-tuned half-precision kernel, you'd be only seven percent better than what the generated system actually achieved.

20:25Bella: That's astonishing.

20:26Eric: The for this one is also worth telling because the failures are vivid. The tried quantization on the compute-bound body of the model — regression, made it slower. Tried — produced NaNs. Tried PyTorch's compile mode — altered outputs in ways the Judge caught. Tried fp16 across the board — same. The breakthrough came from noticing an asymmetry: the body of the model was compute-bound but the head was bandwidth-bound. So int4 quantization, which trades compute cycles for memory bandwidth, only helped on the head. The other big win — they call it the — was realizing they could skip the unconditional branch on most diffusion steps and reuse a cached vector instead. Standard generic frameworks don't have that machinery because they don't even know there's a branch.

21:26Bella: And on the , the same case sees a more modest improvement — about twenty percent better latency, not six times. Which is exactly what you'd expect, because the H100 stack is well-optimized. The specialization wins are biggest where the existing tooling is thinnest. Which is, when you think about it, the long-tail argument into a single contrast.

21:53Eric: They cover six scenarios in total. We're skipping the streaming speech-recognition case and the MacBook -decoding case in detail — both show similar wins, around one-point-seven times faster on the streaming case and about two-point-six times on the JSON one. Same shape of story: generic stack misses some specific optimization, bespoke stack lands it.

22:19Bella: Okay. Eric, this is where I want you to push, because the steelman matters here and the paper is unusually candid about its limitations. What are the real worries?

22:30Eric: There are several, and I want to take them in order of how much they bite. The first one — the one the authors flag explicitly — is single-seed runs. Every scenario is reported from one -loop run. Coding agents are stochastic. Different runs might land on different optimizations or fail to land them at all. The variance of these headline numbers is unknown. We don't know how often the loop fails outright versus produces a working-but-mediocre system. For a paper making this strong a design-space claim, the absence of "we ran it ten times and here's the distribution" is a real gap. The second is the correctness gate. The whole architecture leans on the Accuracy Judge, but the Judge runs a *user-supplied* checker. The paper is upfront that fully verifying serving-system semantic accuracy is an open problem and out of scope. So the headline correctness claims are only as strong as each user's checker. The checker accepts images with PSNR above thirty-five decibels and SSIM above point-nine-eight against the baseline. That's a reasonable quality bar. But it's a quality bar, not a correctness proof, and the agent has incentive to find optimizations that satisfy *that exact bar*, including ones that wouldn't survive in a different evaluation regime.

23:57Bella: And the question is connected to that. The Judge looks for specific known patterns of cheating — the prompt-keyed completion cache, fast paths that skip inference. But that's a known-pattern allowlist. Any system that explicitly looks for cheating can only catch the cheating it knows to look for.

24:18Eric: Right. The third worry is the skills library, and I think this is the most subtle one. The library distills knowledge from existing serving engines and reference implementations. The is allowed to inspect existing systems. The line between "specializing from scratch" and "porting and tweaking existing designs" is fuzzier than the framing implies. The authors do note this — they say reusing baselines doesn't get competitive performance in the long-tail scenarios, which is true — but the *standard* scenario, the parity-with- result, is suspicious in this regard. How much of that parity comes from the agent re-deriving vLLM's design choices, versus learning them from skill entries that summarize vLLM?

25:06Bella: Which doesn't necessarily undermine the *practical* argument — if you can press a button and get -quality serving for your custom workload, the world is better off — but it does soften the conceptual claim that this is fundamentally generation-time specialization rather than, say, automated porting.

25:26Eric: Fourth — compute cost. The "engineering cost has dropped" argument elides that fourteen to twenty-five hours of LLM-call time on , three hundred and sixty calls in Scenario A, isn't free. It's hours, not engineer-years, which is the right comparison. But it's also not free. The economics work great for high-volume production deployments where amortizing that cost over many requests is trivial. They look much less obvious for one-off or low-traffic deployments — which, somewhat ironically, is exactly the long-tail case the paper is making the strongest argument for.

26:05Bella: That's the cleanest tension in the paper for me. The long-tail deployments are the ones where bespoke serving matters most, and they're also the ones where you can least amortize the synthesis cost.

26:18Eric: And the last worry — the comparison baselines. Several scenarios compare against " with a plugin" or against the PyTorch reference. They don't compare against alternative bespoke implementations or against recently-released specialized engines. Where well-tuned alternatives exist, like 's deployed predicted-outputs system, they aren't in the comparison. So we're seeing "bespoke beats generic" rather than "bespoke beats other bespoke," which is a softer claim.

26:51Bella: All of that is fair. And to the authors' credit, almost every one of those critiques is acknowledged in their own limitations section. They don't oversell.

27:02Eric: They really don't. Which I appreciate. The paper reads as people genuinely trying to test a design-space hypothesis, not as people trying to win a benchmark contest.

27:13Bella: So let me try to land the conceptual point, because I think it survives the steelman. The interesting claim is *not* that is going to replace . It probably isn't, at least not soon. The interesting claim is about which abstractions in our infrastructure software exist because they're the right abstractions, and which exist because we couldn't afford to specialize. If shift the cost curve even partially — if more cases like on a MacBook become tractable — then a bunch of design-space arguments that were settled by economics rather than by principle come back open. LLM serving is plausibly just the first domain where this becomes obvious. Compilers, databases, kernels, network stacks — they all have the same structure. A few general systems that paid an abstraction tax because per-target engineering was expensive.

28:12Eric: And the empirical evidence in the paper — six speedups in scenarios where the general framework was either suboptimal or didn't run the workload at all — is the kind of evidence that doesn't *prove* that case but does make it more concrete than the papers ever could. Because the exokernel papers were arguing in principle. This paper is arguing with a system that produced runnable code and passed correctness checks, in hours.

28:40Bella: The thing I'll be watching for is the variance question. Whether anyone reproduces these numbers across many seeds, across many problem instances, and whether the win rate is high enough that a deployment team can actually rely on this as part of a workflow. If you have to run the loop ten times to get one good system, the economics shift again. If you can run it once and it usually works, the design-space argument really is open.

29:08Eric: That's the right next experiment to be excited about.

29:12Bella: Eric, anything else worth flagging before we close?

29:15Eric: One small thing I want to leave with the listener. The paper's actual contribution isn't the speedups. The speedups are evidence. The contribution is the architectural answer to "how do you keep a coding coherent across many hours of structurally different work" — the outer planner with durable state, the inner trio with role separation in fresh contexts, the skills library. Whether that specific architecture generalizes beyond serving systems, or whether each domain needs its own scaffolding, is genuinely open. But it's the kind of engineering result that's worth more than its headline numbers, because it answers a question other people are going to ask.

29:58Bella: Right. The numbers are the demo. The architecture is the contribution. This was "AI Papers: A Deep Dive." The show notes have a link to the paper and to related materials if you want to go further. Thanks for listening.