Why Your Coding Agent Stalls While the GPU Runs Hot
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
Modern LLM serving stacks were built for chatbots, and agents are quietly breaking them — pinning GPUs at full utilization while users wait minutes for replies. A new paper from Duke argues the fix isn't bigger hardware but borrowing scheduling ideas from 1970s operating systems, and the measured speedups are hard to ignore.
What you'll take away
- Why throughput dashboards lie for agent workloads, and what 'goodput' — finishing within a multiple of a task's ideal time — actually measures
- The two pathologies that crater agent latency: KV cache thrashing during tool pauses, and CPU-GPU coupling that strands GPU capacity
- How MARS unifies scheduling and KV eviction under one priority order using a multi-level feedback queue lifted straight from classical OS design
- The headline numbers — up to 5.94x mean latency reduction on a controlled testbed, but only ~1.87x in a real OpenHands deployment — and why the gap matters
- Where the paper's framing is generously tuned: an alpha-of-three success bar, single-GPU experiments, baselines reimplemented inside MARS's stack, and a constructed long-context workload
- The broader shift the paper represents: LLM serving professionalizing into systems research, with sessions-as-processes and KV-cache-as-virtual-memory as the new vocabulary
Chapters
- 00:00The busy-GPU, broken-agent puzzle
- 02:59Throughput vs. goodput
- 05:58Two pathologies: KV thrashing and CPU-GPU coupling
- 08:58Inside MARS: observability, admission control, scheduling
- 11:57The chunk-shrinking trick and other small cleverness
- 14:56What the numbers actually show
- 17:56Where the paper reaches
- 20:55Serving as systems research
References in this episode
- Efficient Memory Management for Large Language Model Serving with PagedAttention — The vLLM paper that MARS builds on top of — essential context for understanding
- Autellix: An Efficient Serving Engine for LLM Agents as General Programs — The program-aware scheduler MARS positions itself against — the episode frames i
- MemGPT: Towards LLMs as Operating Systems — A kindred-spirit system in the OS-vocabulary-for-LLMs lineage the episode highli
Full transcript
Also available as a plain-text transcript page.
0:00Juniper: Picture an LLM serving dashboard. GPU utilization pinned in the high nineties, tokens streaming out at full bore, throughput numbers that would have looked impossible two years ago. By every metric the dashboard is built to track, the machine is healthy. Now picture the actual coding agent running on top of that machine — pausing thirty seconds between every tool call, stalling for whole minutes on tasks that should take one, flat freezing the moment a few users show up at once. Both pictures are of the same hardware, at the same instant.
0:38Finn: That gap — busy GPU, broken agent — is the puzzle this paper sits inside. It's called "MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems," out of Duke, posted to arXiv in April. We're recording a few weeks later, in early May. Quick note before we dig in: this is an AI-generated deep dive — I'm Finn, and Juniper and I are AI voices from Eleven Labs, and the script is from Anthropic's Claude Opus 4.7. Neither company is involved in producing the show. And that gap between busy GPU and broken agent is the thing every coding agent you've heard of — Cursor, Devin, Claude Code, OpenHands — runs into some version of, even if their teams don't always say so out loud.
1:27Juniper: The paper's diagnosis is sharp. It says modern LLM serving stacks — vLLM, TGI, the whole open-source set — were designed for a world that doesn't exist anymore. That world was chat. A user types a prompt, the model generates a response, done. Pile up many independent short requests, batch them, optimize tokens per second. That problem is well solved. The trouble is, agents broke three assumptions inside it.
1:56Finn: Walk me through the three, Juniper.
1:59Juniper: First: a request is no longer one-shot. An agent generates a few hundred tokens, calls a tool — run this code, edit this file, search this repo — waits, gets a result, generates again. Maybe ten or twenty rounds. So what the scheduler used to think of as a single request is actually a long-lived session that pauses and resumes repeatedly. Second: contexts ballooned. We're not talking chat-scale prompts. The benchmarks in this paper sit at prompt sizes from a hundred thousand tokens up past a quarter-million — entire codebases plus accumulated tool traces. Third, and this is the one most people miss: the work isn't all on the GPU anymore. When the agent calls a tool, the GPU side pauses while the host CPU suddenly gets busy executing whatever the tool actually does. Same machine, two completely different kinds of work, stepping on each other's resources.
2:59Finn: And the existing schedulers each grab one piece of the elephant.
3:03Juniper: Right. Throughput-centric engines admit work in arrival order and have no idea a session even exists. Program-aware schedulers — Autellix is the one MARS cites — understand the agent's logical structure but are blind to physical resources. Tool-aware systems like Infercept and Continuum try to manage the KV cache during tool pauses, but they make their decision at the moment the tool starts, with a static heuristic, and then don't revisit it. So you end up with three classes of system, each correct about one axis and oblivious to the others.
3:41Finn: Before you go further — I want to ground the diagnostic the paper builds the rest of its case on, because everything turns on it. The paper draws a hard line between throughput and goodput. Throughput is what every dashboard measures: tokens per second, how busy the GPU is. Goodput is something different — it's the rate at which workflows actually finish within a reasonable time budget. And here's the trick: "reasonable" is scaled to how hard the task inherently is. The paper measures the ideal time a request would take if it were running alone, with nothing else in the system. Then it asks: did this request finish within, say, three times that ideal? If yes, it counts. If not, it doesn't, no matter how many tokens the GPU pumped out along the way.
4:30Juniper: The kitchen analogy is the cleanest one. A restaurant kitchen at full tilt — every burner hot, every cook moving — looks productive. But if half the dishes get sent back, started over, sit on the pass cooling — the throughput of cooking activity is high while the goodput of meals delivered is low. Existing LLM servers are kitchens optimized for cooking activity. MARS is trying to optimize for meals served.
4:56Finn: And the line in the paper that captures the whole thesis is: throughput measures how busy the engine is; goodput measures whether the system remains on track. That's the sentence the rest of the paper exists to defend.
5:11Juniper: The empirical hook for that defense is one chart, and it's brutal. The authors plot baseline systems under increasing load. Token throughput stays high — the kitchen looks busy. Goodput collapses to near zero — almost nothing's actually completing within budget. The dashboard is lying to you. Or rather, the dashboard is telling you the truth about a quantity that no longer correlates with whether your users are happy.
5:38Finn: So why does it collapse? What's actually going wrong inside?
5:42Juniper: Two pathologies, both rooted in the agent shape we just walked through. The first is around the KV cache. When a transformer generates text, every token it has already seen leaves behind a "key" and "value" tensor that future tokens need to attend back to. Recomputing those from scratch every step would be wildly expensive, so serving systems cache them on the GPU. For chat, that cache is small and disposable. For an agent with a quarter-million-token context, the cache for one session is enormous, and it sits on the most precious resource the machine has — high-bandwidth GPU memory.
6:19Finn: And every time the agent pauses for a tool, you have a decision to make about that cache.
6:25Juniper: Exactly the decision. When a session pauses, do you keep its KV cache resident — paying memory rent on something idle — or evict it and pay to rebuild the prefix when the session wakes back up? The hotel-room analogy works well here. You've checked into a room, you've spread out your stuff, and now you're going sightseeing for the afternoon. Do you keep the room? Costs you a night's rate on something you're not using. Do you check out? Then when you come back you rebuild everything from scratch. If the hotel is empty, obviously keep the room. If it's overbooked and there are people waiting in the lobby, your idle room is the problem. The right answer depends on conditions you can only see at the moment of decision — not when you first checked in.
7:12Finn: Which is what tool-aware baselines get wrong. They commit to keep-or-evict at the moment the tool starts, and then conditions change.
7:21Juniper: Right. And the failure shows up viscerally. The paper measures time-to-first-token for each round of an agent loop, and reports that baseline systems consistently exceed a thousand seconds at the P99 tail. A thousand seconds. Round-by-round, at the ninety-ninth percentile. That's the difference between an agent that feels alive and one that's effectively frozen.
7:44Finn: The second pathology was the CPU-GPU coupling, right? Sessions blocked on tools strand GPU capacity.
7:50Juniper: Yeah. While a session is waiting for its tool to finish, it's holding a slot in the system but doing no GPU work. The GPU has cycles it can't use, because admission was already given to a session that's gone idle. Meanwhile the CPU side is suddenly hammered by the tool. The system is congested in two different places that the scheduler can't see at the same time.
8:14Finn: Okay. So now we know what's wrong. What does MARS actually do about it?
8:20Juniper: The answer is structurally simple even if the details are intricate. Three layers. The bottom layer is observability — the paper calls it the Unified Information Stream. The middle layer is admission control. The top layer is the internal scheduler. The thing that ties them together is one principle: the unit of optimization is the session, not the request, and the scheduler should see GPU pressure and CPU pressure simultaneously.
8:48Finn: Take the layers in order.
8:50Juniper: The bottom layer is plumbing, but it's load-bearing plumbing. Every phase boundary in an agent's life — GPU submit, first token, generation end, tool start, tool end — emits a structured event. On top of that, the system tracks how many tools are active and how long they're typically taking, with exponential smoothing so a single outlier doesn't throw the estimate. The clever pragmatic choice is what they don't measure. They reject byte-level memory counters in favor of block-level allocator state, because the KV cache lives in fixed blocks anyway. And they reject hardware CPU instrumentation in favor of just counting active tool invocations. Both choices keep the system portable across hardware.
9:37Finn: So this is the air-traffic-control tower that can finally see runways and weather and how long planes have been circling, all at once.
9:46Juniper: That's the right frame. And once you have that visibility, the next two layers can be coordinated. The middle layer — admission control — does two things. It sorts the waiting queue based on what regime the system is in. Normally, favor small sessions. When the CPU is congested with running tools, favor sessions that are heavy on the GPU side, because those won't add CPU load. When KV memory is tight, pack with awareness of memory cost. The second piece of the middle layer is the one that controls how much new work gets admitted at all, and it uses AIMD.
10:24Finn: AIMD being the same control loop that runs every internet connection — additive increase, multiplicative decrease.
10:32Juniper: Exactly the same shape. When everything looks healthy on both the GPU and CPU sides, the admission window grows linearly. The moment either subsystem shows stress, the window gets cut multiplicatively — slammed. The driving-in-fog intuition gets at why the asymmetry. When you can see clearly, you accelerate gently. When you spot brake lights, you don't taper off; you brake hard. Cautious going up, aggressive coming down. That asymmetry is what makes it stable when you can't directly observe the resource you're sharing.
11:08Finn: And then the top layer is the internal scheduler, which is where the genuinely novel work is.
11:14Juniper: This is where I want to spend time. The internal scheduler replaces first-come-first-served — which is what most LLM servers do by default — with a multi-level feedback queue. MLFQ. This is a 1960s-and-70s OS idea, and the analogy that makes it click for most people is your laptop. When you click a button, the operating system gives that work a quick shot at the front of the line. If it finishes in milliseconds, you never noticed the click had to wait. If it turns out to be heavy — compiling something, running a simulation — the OS demotes it to a lower-priority queue so the next click can also feel instant. Add a rule that anything waiting too long gets promoted back up so nothing starves forever. That's why your laptop with fifty processes running feels fluid.
12:07Finn: So agent sessions get treated the same way.
12:10Juniper: Sessions get scheduled exactly the way processes do on your laptop. A session arrives small — small initial KV footprint, no accumulated service — and lands at high priority. If it finishes quickly, the user feels a snappy response. If it keeps consuming GPU, it gets demoted. If it's been waiting a long time at lower priority, it gets a bounded promotion. Short interactive agent calls jump the queue; long heavy ones make steady progress in the background.
12:41Finn: And here's where I'd flag what I think is the cleanest piece of the design — Juniper, the part that surprised me on the read. The same priority structure that decides what runs next also decides what gets evicted from KV cache when memory's tight. Lower-priority sessions become the eviction candidates. And among those candidates, the ones with the largest KV footprints go first, because they release more memory immediately.
13:10Juniper: That's the move. Two decisions that fight each other in most systems — what to schedule next versus what to evict — get unified under one ordering. The scheduler isn't fighting itself. And the KV residency decision becomes a continuous runtime decision rather than a one-shot guess. A pinned context can be reclaimed mid-session if active requests need the memory more than the idle session needs to stay warm. Back to the hotel analogy — the room reservation can be revoked if the lobby fills up, even though you said you'd be back.
13:43Finn: There's one more trick I want to make sure we cover, because it's surprisingly clever for something this small. When MARS tries to allocate KV memory for the next chunk of work and it doesn't fit — there's not enough contiguous space — the obvious response is to preempt: evict somebody, free a slab, retry. MARS instead shrinks the chunk. It breaks the requested allocation down into smaller pieces, all the way down to a single block if it has to, until something fits.
14:12Juniper: The suitcase analogy. Your suitcase won't quite close, the naive move is to repack everything from scratch. MARS instead folds things smaller until the lid latches. You make slower progress on this one item — service granularity drops — but you don't have to throw out the suitcase and start over. It converts what would be a hard preemption failure into a graceful slowdown.
14:36Finn: And the design of that whole stack is honestly modest in implementation terms. Five thousand-some lines of Python on top of vLLM. They didn't touch the lower-level batching and attention paths. This is a relatively small intervention that produces large effects.
14:52Juniper: How large? That's where the empirical thread lives. Let me hand the numbers to you, Finn — they're worth setting up carefully.
15:00Finn: The headline number is almost six times faster. On a controlled testbed with Qwen3-Coder-30B, MARS produces up to a five-point-nine-four-times mean latency reduction against the strongest baseline at each load point. Now, the authors also report a separate result on GPT-OSS, the larger 120B model, where mean latency improves by up to seven-point-five-six times — but worth flagging that this larger figure is on a shortened-input workload the authors had to construct because GPT-OSS-120B's context window caps at 131K tokens, so it's not exactly the same setup as the 5.94 number. So they're related but not directly stackable. The interpretation the authors offer is that bigger models have more memory pressure, which means smarter KV management pays bigger dividends.
15:50Juniper: But the number you wanted to lead with for honest real-world impact was different.
15:55Finn: Right. The controlled-testbed numbers are the ceiling — they tell you what the architecture can do under conditions designed to expose the differences. The number that probably matters more for people building products is the OpenHands deployment number. They wrap MARS into a real coding-agent framework end-to-end and measure task completion. There the gain is up to one-point-eight-seven times — nearly twice as fast on real tasks. Meaningful, but not transformative. The authors are honest that framework overhead absorbs latency outside the serving backend.
16:32Juniper: There's one image from the results I want to put in front of the listener, because it's the kind of thing you don't forget. They graph KV cache evictions per second over the course of a run. Baseline systems do somewhere between ten and over a thousand block evictions per second, continuously, throughout the entire run. MARS does aggressive eviction at the initial load spike and then suppresses it almost completely. Panicking continuously versus pacing yourself. The baselines aren't unhealthy because they fail at any one moment — they're unhealthy because they thrash without ever stabilizing.
17:09Finn: That's a good place to move into the critique. Because as much as I think the framing is genuinely sharp, there are several places a careful reviewer would press.
17:20Juniper: Press them, Finn.
17:21Finn: First, the goodput numbers in the headline use an alpha of three — meaning a request counts as successful if it finishes within three times its ideal isolated time. That's a pretty wide bar. If your ideal time is a minute, three minutes still counts. The paper does include alphas of one and two in ablations, but the dramatic collapse-vs-survival framing rests on alpha equals three, where the baselines look worst. A more demanding bar would compress the apparent gap. Not by everything — the gap is real — but the magnitude of the reported win is partly a function of where you draw the line.
17:59Juniper: That's fair. What else?
18:00Finn: Second, every baseline in the comparison was reimplemented inside MARS's modified vLLM stack. That's the right methodological choice for fair comparison — you don't want to be benchmarking different infrastructures with different overheads. But it does mean the baselines aren't running with their authors' best-tuned configurations. If Autellix or Continuum were deployed under their original design assumptions, the gaps could narrow. Third, every experiment is single GPU. The whole conceptual frame — holistic visibility unlocks better scheduling — gets harder to maintain when telemetry has to be coordinated across multiple nodes. Real production deployments are multi-replica with load balancing, and live KV migration between nodes is expensive. The authors flag this as future work, candidly. But the central claim hasn't yet been tested where most production systems actually live.
18:52Juniper: And the workload mix is constructed.
18:54Finn: Yeah, the input-length regimes — they have four of them — are sampled from a pool of benchmarks specifically to construct progressively heavier long-context loads. That's reasonable methodology. It also means the workload is curated to stress the dimension MARS is best at. A workload dominated by short, uniform agent calls might show smaller gains, because most of the gains come from being smart about long-lived sessions and large KV states.
19:20Juniper: The authors also flag a regime where their own opportunistic chunk-shrinker actually hurts.
19:26Finn: They do. At the lightest workload at half a request per second, the overhead of running the opportunistic co-scheduler exceeds its benefit. They call it a minor performance inversion. It's an honest acknowledgment that the architecture has regimes where it's adding cost without value, and they don't yet have an adaptive heuristic for when to engage it. That's the kind of detail that gets papered over more often than it should.
19:51Juniper: And the discussion section is unusually candid in general. They acknowledge fairness is sacrificed for global progress — MARS is not the system you'd run if you were a multi-tenant cloud serving competing customers who need isolation. They acknowledge no multi-GPU scaling story. They acknowledge sessions are modeled as linear loops, so anything DAG-shaped — tree-of-thoughts, multi-agent collaboration, fork-join pipelines — would need richer primitives they suggest but don't build.
20:23Finn: Voicing those alongside the steelman is the right posture. The paper is candid; we should be candid with the listener about what it doesn't claim.
20:32Juniper: Where does this fit in the broader arc of the field, in your read, Finn?
20:37Finn: There's a quiet shift underway in how LLM serving is conceived, and MARS is a clean example of it. The first generation of serving systems treated inference as a specialized kind of batch processing — many independent requests, optimize the throughput of the GPU, done. That worked beautifully for chat. But as the workloads on top of LLMs got more elaborate — long contexts, retrieval pipelines, tool-using agents — the serving layer started having to track state, sessions, and resources that span beyond a single GPU computation.
21:12Juniper: And the vocabulary borrowed for that shift is operating-systems vocabulary.
21:17Finn: Exactly. Sessions as processes. KV cache as virtual memory. Tool calls as syscalls. Schedulers that have to coordinate across heterogeneous resources. MARS is explicitly importing classical OS techniques — feedback queues, congestion control, working-set thinking — into a stack that historically was built more by ML engineers than systems engineers. The bet is that as agent workloads become the dominant commercial use of LLMs, the bottlenecks will increasingly look like the bottlenecks the OS community already learned to manage in the seventies and eighties, and the answers will rhyme with what worked there. MemGPT and AIOS are kindred-spirit systems in that lineage. MARS adds a concrete demonstration that the rhyming actually produces measurable wins.
22:04Juniper: There's a second thread I'd add to that. The field is still working out what to even measure. Tokens per second is easy and visible but doesn't track user-facing experience for agent workloads. Tail latency per request is closer but doesn't account for how hard the task inherently is. Defining a goodput metric scaled by intrinsic task difficulty is part of a larger move toward evaluations that reflect whether users are happy rather than whether GPUs are warm. That metric work might end up being more durable than the specific architectural choices in MARS.
22:40Finn: That's a good thing to leave the listener with. The architecture might evolve. The diagnostic frame — the gap between busy GPU and useful work — is going to outlive any single system that closes it.
22:53Juniper: For a listener building with agents, the practical takeaway is that the bottleneck in your stack probably isn't model size or GPU memory in isolation. It's coordination. With the same hardware and the same model, you can get two-to-six times better tail latency just by being smarter about admission, scheduling, and KV retention. That's the kind of factor that turns an agent product from impractical to viable.
23:19Finn: For a listener watching the field, MARS is a sign that LLM serving is professionalizing into something that looks more like systems research and less like ML engineering. The next few years of serving papers are going to look more like this one — explicit OS framings, careful telemetry, scheduling disciplines borrowed from the canon — than like the throughput-maximizing batching papers of the last cycle.
23:44Juniper: The paper is from ee-FAY Wang, han-CHUNG yeh, and a team at Duke. Show notes have a link to it and related materials. Worth a read if this episode caught you.
23:54Finn: Thanks for listening to AI Papers: A Deep Dive.