All episodes
Episode 063 · May 21, 2026 · 26 min

Why Web Agents Are Slow: A Compiler-Style Fix for Computer-Use Latency

Winston, Wang, Mirhoseini et al.

LLM Agents Web Automation
AI Papers: A Deep Dive — Episode 063: Why Web Agents Are Slow: A Compiler-Style Fix for Computer-Use Latency — cover art
paperdive.ai
Ep. 063
Why Web Agents Are Slow: A Compiler-Style Fix for Computer-Use Latency
0:00
26 min
Paper
Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling
Venue
arXiv:2605.21470
Year
2026
Read the paper
arxiv.org/abs/2605.21470
Also available on
Apple Podcasts Spotify

Most of the time a web spends on your task is just the language model talking to itself, one screenshot at a time. A new Stanford paper argues that the slowness and brittleness of computer-use agents aren't a model problem at all — they're an architecture problem — and that decades-old compiler techniques can cut latency by ten times while improving accuracy.

What you'll take away

  • Why current web act like interpreters running one LLM call per step, and what a 'compiler' for agents would do differently
  • How state-based and postconditions on cached tools eliminate roughly half of all failures before runtime
  • The worked Taco Bell example where one candidate plan uses an LLM to compare two numbers and another uses Python's min — a 50x cost difference for the same task
  • When ' with four parallel browser sessions' actually beats serial execution, and why click latencies make the math work
  • Why this approach only pays off for repeated workloads on the same apps, and where cache staleness could quietly erode the savings
  • The honest limitations: a 25–90 minute offline setup per app, an LLM-based prediction inside the scheduler, and a 37-task benchmark partly curated to exercise scheduling

Chapters

  1. 00:00Agents as interpreters, and the compiler reframe
  2. 03:16State contracts and static verification
  3. 06:33The Taco Bell example and cost-based planning
  4. 09:50Heavy tails and the case for hedging
  5. 13:07Offline setup vs. online speedup
  6. 16:23Headline results and a fair baseline comparison
  7. 19:40Where the approach is weakest
  8. 22:57The durable principle underneath

References in this episode

Also available as a plain-text transcript page.

0:00Bella: Here's a stat that should bother anyone who's ever watched a computer-use crawl through a web task. Seventy-three percent of the time it spends doing your task — booking the flight, ordering the food, filling out the form — is just the language model talking to itself. Take a screenshot, ask the LLM what to click. Take another screenshot, ask again. Click by click, the model is in the loop for every single step, which is most of where the latency lives and most of where the errors come from.

0:31Tyler: And the second stat that should bother you, sitting right next to it: somewhere between forty-five and fifty percent of the failures these have are tool-ordering mistakes. They click the wrong button. They type in the wrong field. They try to add something to a cart before they've navigated to the product page. Which is a category of bug that compilers solved decades ago for regular code.

0:56Bella: So a paper went up on arXiv yesterday — May twentieth, twenty-twenty-six — and we are recording on May twenty-first, one day later. It's called " Compilation for Latency-Optimizing Web Agent Planning and Scheduling," out of Stanford, and it takes that observation seriously. Before we go further: this episode is AI-generated. I'm Bella, that's Tyler, we're both AI voices from Eleven Labs, the script was written by Anthropic's , and the producer isn't affiliated with either of those companies. And the reason the paper matters is that it argues — pretty convincingly — that the slowness and the brittleness of computer-use aren't a model problem at all. They're an architecture problem.

1:42Tyler: Right. The frame the authors propose is that we've been building wrong, and the right model is sitting in a compilers textbook from the eighties. So we want to walk through that frame, the three pieces of the system they build on top of it, and then poke at where it doesn't quite hold up.

2:01Bella: Let me start with the reframe, because it's the spine of everything else. Think about how a normal program runs. There are two ways. One is an interpreter: it reads the first line, does it, reads the next line, does it, no planning, no optimization, just sequential execution one statement at a time. The other is a compiler: it looks at the whole program before running anything, figures out the best machine instructions to produce, checks for obvious bugs, and only then hands the result over to be executed. Compilers are faster because they can plan. They can hoist work out of loops. They can reject invalid programs before they ever touch the .

2:44Tyler: And current web are interpreters. Pure interpreters. There is no plan. There is only "look at the screen, decide one thing, do it, look again."

2:54Bella: Exactly. And the authors' move is to say: what if we treated the user's request — "order me the cheapest item from Taco Bell" — as source code, and ran a compiler over it? The compiler would generate an actual program, in something Python-like, that orchestrates a mix of cached browser actions and occasional LLM calls only where genuine ambiguity remains. It would check that program for correctness statically. It would price candidate programs against each other and pick the cheapest. And then it would decide, at the end, how to actually execute the winning program — serially, in parallel, or with some kind of .

3:36Tyler: They call the whole thing the compiler for . JIT meaning just-in-time — you compile at the moment the user gives you the task, not weeks ahead. And the system has three big pieces that map cleanly onto things compilers have always done. Static verification. Cost-based optimization. Runtime scheduling. Let's take them in order, because each one is independently interesting.

4:01Bella: Tyler, you take the protocol piece — that's the static-verification half, and it's got the cleanest empirical story.

4:09Tyler: Okay. So the static-verification piece works like this. In any real compiler, before you run a function, the type system checks that the inputs match what the function expects. That's a contract. The authors take that idea and apply it to browser state. Every cached tool — "list restaurants", "add to cart", "go to details page", whatever — declares two things about itself. A : the browser has to be in a particular state for me to run. And a postcondition: when I'm done, the browser will be in this new state. So "go_to_detail_page" might say "I require the page to be a restaurant list, and I guarantee that afterwards the page is a restaurant detail view."

4:53Bella: Which means before you ever run the program, you can walk through it and check that every tool gets called from a state where it's actually allowed to run.

5:03Tyler: Right. And the analogy I keep coming back to is a kitchen with tickets. The cook can't plate the entrée before the entrée is cooked. A good expediter sees that ticket out of order and rejects it at the pass — the food never goes out. The protocol does the same thing for browser tools. A plan that says "click the add-to-cart button" before there's anything in the page to click gets rejected before the browser session even starts.

5:30Bella: And here's where the number that opened the episode comes back. Forty-five to fifty percent of failures in standard web are exactly this kind of mistake. Tools being called in the wrong order. The protocol catches that whole class of bug. They report the failure rate dropping from eighty percent without enforcement to forty-three percent with it.

5:53Tyler: Which is enormous. That's nearly half the errors gone, and they're gone before runtime. The remaining failures shift toward things like syntax errors and type mismatches — bugs of a different and more tractable kind.

6:08Bella: Now there's something subtle here worth flagging. The contracts aren't about numbers or input types in the usual sense. They're about the state of the browser. The page you're on, what's visible, what's been filled in. That's a generalization of what existing tool protocols do. The — the standard tool-typing system right now — checks types on tool inputs. This paper extends that idea from type contracts to state contracts. The tool says "this is the world I expect to be operating in, and this is the world I'll leave behind."

6:46Tyler: And once you have those contracts, the planner can do something a normal can't. It can generate many candidate plans, reject the ones that violate state-flow, and pick among the survivors. Which is the second piece — Bella, this is your turf.

7:03Bella: It is, and the worked example for this is unusually good audio material, so I want to set it up carefully. Imagine you ask the system: order me the least expensive item from Taco Bell. The planner spins up thirty-two parallel workers, each one asking a language model to write a small Python-like program that gets the task done. Each worker can call cached tools, can include LLM calls if it really needs to reason about something unstructured, and can use ordinary code — for-loops, comparisons, arithmetic.

7:38Tyler: And the workers don't always agree. You get a spread of candidate plans, and they vary wildly in quality.

7:45Bella: Wildly. The paper shows three of them. Plan A says: call "get store details" for the Taco Bell location and then order the cheapest item. But — and this is where the static check fires — "get store details" has a that you've already navigated to a specific store. Plan A skips that step. The compiler walks the program, propagates the browser state symbolically, and at the moment it hits "get store details" it sees the state doesn't match the precondition. Plan A gets rejected without ever running. The browser never opens.

8:22Tyler: That's the analogy holding cleanly. A recipe that says "add the sauce" before listing what sauce to make — the kitchen rejects it before turning on the stove.

8:33Bella: Right. Now Plan B passes the static check. It's a valid program. It navigates to the Taco Bell page, lists the items, gets their prices. But then at the moment where it needs to pick the cheapest one, it makes a language model call. Something like "of these prices, which is the smallest number." Which is — I mean, it works, but it's absurd. You're asking a billion-parameter language model to do a comparison that takes a calculator a microsecond.

9:05Tyler: You're asking a Michelin chef which of two numbers is smaller.

9:10Bella: That is exactly the framing the paper invites. And the cost model the planner uses is built precisely to notice this. The way to think about it is: every costs a dollar. Every LLM call costs a hundred dollars. And if either of those operations happens inside a loop, the cost gets multiplied by ten for every level of nesting. So a single LLM call inside a loop over fifty items isn't a hundred dollars — it's a thousand. The planner walks the control-flow graph of each candidate plan, totals up the costs, and ranks them.

9:48Tyler: And the goal of the cost model isn't to predict actual wall-clock time. It's just to put the plans in the right order — cheapest at the top.

9:58Bella: Exactly. It's a ranking function, not a latency predictor. Which is honest, and worth noting. Now back to the example. Plan B totals up to about ten units of cost in the paper's accounting — mostly that one LLM call to compare numbers. And then Plan C does the same task, but at the comparison step it just uses ordinary Python. A "min" over a list. No LLM. Plan C totals up to about a fifth of a unit of cost.

10:27Tyler: A fiftieth of Plan B.

10:28Bella: A fiftieth. Same task. Same result. The user gets the cheapest Taco Bell item either way. But one plan does it in code and the other burns a language model call on a thing a pocket calculator could do. And the planner picks Plan C automatically because the cost model knows LLM calls are expensive and ordinary code is essentially free.

10:52Tyler: This is the moment the whole reframe pays off, I think. Because once you see Plan C, you see what was wrong with the interpreter style of . The interpreter would have done Plan B every time. Every step is an LLM step, because the interpreter doesn't know any other way. You can't even ask the question "could this step be cheaper?" if you're not looking at the whole program at once.

11:18Bella: Right. And there's a beautiful structural point in the paper here. The static check and the cost accounting happen in the same walk of the control-flow graph. One pass through the program does both jobs — verify state-flow correctness, total up the cost. That's just elegant compiler engineering.

11:38Tyler: Okay. So at this point we have a valid, cost-optimized program. The static check has thrown out the impossible ones, the cost model has ranked the survivors, and Plan C is the winner. You'd think we're done.

11:53Bella: We are not done.

11:54Tyler: We are not done. Because there's a whole separate question, which is: how do you actually run Plan C? You've got, say, four cores available on the 's machine. Do you run the plan once, top to bottom? Do you split it into independent sub-tasks and parallelize? Or do you do something stranger — start four copies of the entire plan at once, run them in parallel browser sessions, and take whichever one finishes first?

12:23Bella: This is the scheduler. And this is where the paper gets genuinely counterintuitive, so take us through it.

12:30Tyler: Yeah, this is the part I find most interesting, honestly. The naive instinct is that "start four copies and throw three away" is pure waste. Why would you ever do that? You've done four times the work for one task. And for a lot of operations, that instinct is correct — it is just waste. But for some operations it isn't, and the reason it isn't has to do with the shape of how long things actually take. Most listeners have an intuition that the average time for something is a reasonable summary. Like, if clicking the add-to-cart button takes twenty-five seconds on average, then twenty-five seconds is about what you should plan for. For a lot of web UI elements, that intuition is wrong.

13:14Bella: Because the distribution is .

13:16Tyler: Heavy-tailed. The paper has a specific example — the add-to-cart button on one of the food-delivery sites they tested. Average click time is about twenty-five seconds. But the actual range is six seconds to ninety-one seconds. Most of the time it's quick. Occasionally, for reasons that probably have to do with the site's backend or layout reflow or ad loading, it takes a minute and a half. And those occasional ninety-second clicks are dragging the average up.

13:45Bella: So if you run one plan and you happen to hit the long tail on that click, you're stuck.

13:51Tyler: You're stuck for a minute and a half. But — and this is the key insight — if you run four copies of the plan in parallel, in four separate browser sessions, the odds that *all four* of them hit the long tail are tiny. To lose, all four have to be slow at the same time. So you take whichever finishes first, and you've essentially sliced off the tail. The analogy I want to use here is racing four taxis to the airport. You're in a city where most taxi rides to the airport are twenty-five minutes but occasionally one takes ninety. You can't afford to miss your flight. So you call four taxis at once, you take whichever shows up first, and you let the other three go. You've wasted three fares. But the chance you get stuck with a ninety-minute trip is now essentially zero, because to lose you'd need all four taxis to be slow.

14:48Bella: And the punchline is that only wins when there's high variance. If every taxi takes exactly twenty-five minutes, calling four is pure waste.

14:58Tyler: Pure waste. So the scheduler has to figure out, for any given plan, whether the steps in it have enough variance that would help. And here's how it does it. During the offline phase — the one-time setup the system runs per application — it watches the app being explored and records how long every UI element takes to interact with. Every click, every form field, every navigation. It fits a distribution to each one. So by the time you're running a real task, it has, for every element on the page, a learned distribution of how long touching that element typically takes, including the variance.

15:39Bella: And then at task time, it uses those distributions to simulate.

15:44Tyler: Right. So given the plan the planner picked, it asks an LLM to predict which UI elements the plan will touch and roughly how many times. Then it runs a thousand simulations. For each simulation, it draws sample times from the learned distributions, computes how long the plan would take under serial execution, under parallel execution across independent subtasks, and under with four replicas. After a thousand simulations, it has a distribution of expected latencies for each strategy. It picks the strategy with the lowest mean.

16:20Bella: The example the paper gives for when wins is for a task about ordering a sub sandwich under thirty dollars. The bottleneck step is finding and clicking the add button on a complicated menu page. The serial expected latency is around a hundred and thirty-three seconds. The hedge-with-four expected latency is around a hundred and fifteen. And the hedge wins about seventy-three percent of the simulated trials. Which isn't even close.

16:49Tyler: And the reason it isn't close is that one step has the . Hedging tames it. For a different task, where every step is consistent, the simulation would say "serial is fine, don't waste the cores" and the scheduler would pick serial. The point is the decision is data-driven. It's not a hardcoded heuristic — it's a thousand-trial simulation per task.

17:13Bella: I want to pause on something, because we just said a few things that, if you don't separate them carefully, blur together. There are two phases here, and they're easy to mix up. One is offline: the one-time setup the system does per application. That takes somewhere between twenty-five and ninety minutes per app. During that phase, the system explores the app, records UI interactions, fits the latency distributions, and synthesizes the cached tool implementations. That happens once.

17:45Tyler: And after that, every task the user gives the system runs online, in real time, and the online phase is what gets the speedup numbers.

17:54Bella: Exactly. The ten-times-faster number you hear about isn't ten times faster than the offline phase. It's ten times faster than a baseline on the actual user-facing task. The offline phase is a you pay once, and the authors are explicit that this approach makes sense when you're hitting the same application repeatedly — automating workflows on apps your team uses every day — not when you want to do a one-off interaction with a site you've never seen.

18:25Tyler: Worth keeping clear, yeah. So let's talk about what those online numbers actually look like, because they're the reason the paper got .

18:34Bella: The headline. gets more than ten times the speed of — that's the standard open-source baseline that represents the screenshot-and-decide style of . Ten times. And it doesn't trade accuracy for speed. It improves accuracy by twenty-eight percentage points on top of being ten times faster.

18:54Tyler: Which is the kind of result that, when I first saw it, I assumed had to involve some catch. Faster *and* more accurate, by margins like that, usually means the baseline was a strawman.

19:06Bella: And the authors clearly anticipated that, so they ran a control that addresses it head-on. They took a frontier computer-use model — the strongest available at the time — and gave it the same synthesized cached tools that uses. So the has all the tool quality but none of the cost-optimizing planning machinery. The system is still one and a half to two and a half times faster at the same accuracy. Which is the right comparison, because it isolates the contribution of the planning architecture itself from the contribution of just having better tools available.

19:44Tyler: And separately, the scheduler — the strategy picker — gets about two-point-four times speedup over OpenAI's computer-use and a nine-point accuracy improvement. The two components are roughly independent and they stack.

20:00Bella: One number I want to make sure lands. Across thirty-two candidate plans for the same task, the worst plan takes about five times longer than the best plan. And the standard interpreter-style doesn't choose — it just runs whatever the first plan-equivalent it produces happens to be. There's a five-x latency penalty sitting on the floor that current systems are paying every time, because they never thought to compare alternatives.

20:28Tyler: Okay. So this all sounds great, and I think the framing is genuinely powerful, but I want to put my skeptic hat on for a moment because the paper has some real limitations and the authors mostly acknowledge them.

20:42Bella: Please.

20:42Tyler: First, the offline cost. Twenty-five to ninety minutes per application is not nothing. It's a one-time cost, sure, and the authors note it parallelizes down to twenty or thirty minutes in some cases, but it means the approach is essentially confined to repeated workloads. If you're a customer-support team running thousands of tasks a week against the same five apps, this amortizes beautifully. If you want a general-purpose that can handle whatever website a user mentions, this isn't it. The paper is honest about this. Second, cache staleness. UI changes. Web apps redesign things. A/B tests roll out and roll back. Every time a button moves or a form gets restructured, some cached tool's become wrong. The protocol catches the failure at runtime — the postcondition check fails and the system falls back to re-planning — but the authors don't really quantify how often this fires in real deployments over weeks and months. And for a system whose entire value proposition is , that's the load-bearing variable they haven't measured.

21:52Bella: That's a fair pull. The whole pitch is "twenty-five minutes upfront, then save ten times on every task" — and if the cache decays faster than the savings accumulate, the math changes.

22:04Tyler: Right. Third — and this is the most surprising one to me — the scheduler's prediction about which UI elements a plan will touch is itself made by an LLM call. So the whole is conditioned on a language model getting the prediction right. If the LLM systematically misjudges what elements a plan will touch, the simulation is sampling from distributions for the wrong elements and the strategy choice could be wrong. The paper doesn't deeply evaluate how often that happens.

22:37Bella: And the cost model itself, even when it does work, is calibrated to rank correctly — not to predict absolute latency. The paired-comparison gap between best and worst plan is one-point-eight times, even though the mean ratio is five-point-three. Which means the model is good at picking the cheapest, but the spread between actual and predicted cost for any single plan could be wide. A deployment that trusts the ranking absolutely could be wrong in ways that are hard to notice.

23:09Tyler: And finally, the benchmark is small. Thirty-seven tasks across five web applications. Some of those tasks come unmodified from public benchmarks, but the authors also "manually curated" six additional tasks per application to ensure balanced coverage across scheduling strategies. Which is a reasonable thing to do when you're evaluating a scheduler — you need a mix of tasks where each strategy could win — but it does mean the scheduler is being tested partly on tasks designed to exercise scheduling. Generalization to truly arbitrary user requests is undertested.

23:46Bella: I think those are the honest objections, and none of them undermine the core contribution. The contribution isn't "we have a finished product that works on every website." The contribution is "here is a different way to think about the architecture of an , and here's evidence that the thinking is right."

24:07Tyler: Agreed. And I'd say the most durable thing in the paper isn't any of the specific mechanisms — the protocol, the cost-walking planner, the scheduler — though those are all clever. It's the principle underneath. Non-determinism in an should be a deliberate choice, not a default. Most of what current agents treat as "decisions for the LLM" are actually just data flow that ordinary code could handle. The LLM call should be reserved for the moments where genuine ambiguity exists and nothing else will do.

24:41Bella: And once you accept that principle, decades of compiler and systems research opens up. Static analysis. Cost-based optimization. Scheduling. None of these techniques are new in computing. What's new is noticing they map cleanly onto a problem people had been treating as fundamentally novel.

25:00Tyler: The Plan C moment, for me, is where this lands. You have a multi-step task that involves figuring out the cheapest item at Taco Bell. The interpreter-style does it by asking a language model fifteen times what to do next. The compiler-style agent looks at the whole task, realizes the comparison step is just "min" over a list, writes a program that uses one or two LLM calls and a bunch of ordinary code, and finishes in a fifth of the time. Both approaches produce the right answer. One of them is fifty times cheaper at the operation that matters.

25:37Bella: And the cost wasn't being paid for any good reason. It was being paid because the architecture didn't know how to ask the question.

25:46Tyler: Right. That's the shift. The architecture starts asking the question.

25:50Bella: I think that's where to leave it. The show notes have a link to the paper and some related reading on architecture if you want to go deeper. And if you want the full transcript with definitions inline, plus the concept pages that connect this episode to other things we've covered on agent design, that's all on paperdive.ai.

26:12Tyler: Thanks for listening. This has been AI Papers: A Deep Dive.