All episodes
Episode 068 · May 22, 2026 · 27 min

The OS Trick That Makes Tree Search Practical for Coding Agents

Dong, He, Hou et al.

Operating Systems AI Infrastructure
AI Papers: A Deep Dive — Episode 068: The OS Trick That Makes Tree Search Practical for Coding Agents — cover art
paperdive.ai
Ep. 068
The OS Trick That Makes Tree Search Practical for Coding Agents
0:00
27 min
Paper
DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback
Venue
arXiv:2605.22781
Year
2026
Read the paper
arxiv.org/abs/2605.22781
Also available on
Apple Podcasts Spotify

Almost nobody runs tree search on real coding , even though it could add 30 points of accuracy on . The reason isn't the models — it's that and rollback take seconds, and a new paper from Shanghai Jiao Tong and Huawei closes that gap with a couple of clever OS tricks that hide checkpointing inside the LLM call you were already waiting on.

What you'll take away

  • Why gaps are sometimes OS limits, not model limits — and how closes a 30-point accuracy gap on by making checkpoint/rollback cheap
  • How hijacks plus to version a filesystem at runtime without ever duplicating unchanged data
  • The () + combination that gives you 5-millisecond rollback by keeping a frozen 'body double' of the process with almost no memory cost
  • The trick: hiding 15ms of checkpoint work inside the 1-20 second LLM call the was already waiting on
  • Why RL training GPU utilization jumps from about 51% to 99% when you replace shutil.copytree with forked templates
  • Where the design might creak: very large processes, faster LLM inference shrinking the masking window, and side effects that can't be rolled back

Chapters

  1. 00:00The capability gap tree search leaves on the floor
  2. 02:59The diary and the room: why checkpointing is hard
  3. 05:59DeltaFS and the stack of acetate sheets
  4. 08:59DeltaCR: fork() as a frozen body double
  5. 11:58Inference-masking: cooking while the microwave runs
  6. 14:58End-to-end SWE-bench results
  7. 17:58The RL training story: 51% to 99% GPU utilization
  8. 20:57Steelman critiques and where the design might creak
  9. 23:57The bigger reframe: OS substrates for agent workloads

References in this episode

Also available as a plain-text transcript page.

0:00Juniper: An AI coding is trying to fix a bug in Django. It edits a file, runs the tests, sees three failures, and decides — okay, that approach didn't work, let me back up and try something different. On paper, that backup is free. In practice, on a real Linux , it takes about a second and a half. And if your agent is doing tree search with a few hundred expansions, that bookkeeping eats the whole iteration budget.

0:30Tyler: Which is why almost nobody actually runs on real coding , even though the accuracy gains are enormous. There's a figure in this paper showing tree search adds five to thirty percentage points on top of linear search on . For -30B, the jump is from about nine percent to thirty-nine. That's not a minor optimization — that's leaving a third of your potential pass rate on the floor because the operating system underneath you is too slow.

1:02Juniper: Right — and that gap is the puzzle this paper picks up. The paper went up on arXiv on May twenty-first, twenty-twenty-six, and we're recording one day later. What you're hearing is AI-generated: the script is from Anthropic's , and the two of us — I'm Juniper, that's Tyler — are AI voices from Eleven Labs. The show isn't affiliated with either company. The paper we're digging into is ": Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback," from a group at Shanghai Jiao Tong University and Huawei. And the reason it matters is that they take that order-of-magnitude gap between what could do and what they actually do, and they close almost all of it with a couple of very clever OS tricks.

1:54Tyler: How clever? They get down to about fifteen milliseconds. Rollback down to five. The best comparable system before this — 's diff snapshots — sits at roughly five hundred milliseconds to checkpoint and a second and a half to restore. So we're talking thirty times faster on checkpoint, two hundred and fifty times faster on restore.

2:19Juniper: And the end-to-end result is what really lands. On tree-search workloads, the time the spends on state management — checkpointing, rolling back, all that plumbing — drops from roughly half to three-quarters of the total time, down to just a few percent. The LLM calls go from being the thing you wait for, to being the only thing you wait for. Which is how it should be.

2:47Tyler: So let's set up the actual problem. Why is hard in the first place?

2:53Juniper: The intuition pump I keep coming back to is: imagine the has a diary and a room. The diary is its memory — variables, the call stack, what it thinks it just did. The room is the filesystem — actual files on disk that it's been editing. If you reset the diary but not the room, the agent walks in convinced it never installed that package, but the package is still there. If you reset the room but not the diary, the agent is sure it just fixed line forty-two of `settings.py`, except line forty-two is back to the original. Either mismatch and the agent is operating on a lie.

3:34Tyler: And that's not a theoretical concern — that breaks tree search outright. If your search is exploring branches and the branches don't represent coherent states, the rewards are meaningless. The paper has a great line about this: either mismatch breaks the determinism required for correct tree search.

3:55Juniper: So the requirement is: when you checkpoint, you have to capture filesystem and memory together, as a single atomic moment. And when you restore, you have to swap both back together. The existing tools that do this — commits, VM snapshots, process checkpointing — they all duplicate the whole state. Every snapshot is a fresh photocopy of everything.

4:20Tyler: And here's where the authors plant their flag. They notice that between two consecutive checkpoints in an , the state has barely changed. The agent edits one file. Runs one test. Adds one package. The filesystem is the same except for a handful of bytes. The process memory is the same except for a handful of dirty pages. Photocopying everything every time is doing roughly a thousand times more work than the situation requires.

4:51Juniper: That's the key insight, and it's beautifully simple once you say it out loud. Consecutive states are nearly identical. So a checkpoint shouldn't be a full snapshot — it should be a delta from the previous one. Tyler, do you want to take the first mechanism, because the filesystem trick has a really clean shape to it?

5:13Tyler: Sure. So the filesystem half of this is called , and the analogy that makes it click is a stack of transparent acetate sheets. You're keeping a notebook, but instead of writing on paper, you're writing on clear plastic sheets stacked on top of each other. When you read the notebook, you look down through the stack and see whatever's on the topmost sheet that has writing at that spot. When you want to checkpoint — to mark "this is a state I might want to come back to" — you don't photocopy the stack. That would defeat the whole point. You just slide a fresh blank sheet on top, and now any new writing goes on the new sheet. The old sheet becomes read-only history.

5:59Juniper: And to roll back?

6:00Tyler: You peel off the top sheets until you're back where you want to be. Reads automatically fall through to whatever's underneath. The unchanged data is never duplicated — there's only one physical copy of it, sitting in some lower sheet, with all the upper sheets shadowing only the parts that actually changed.

6:21Juniper: And the beautiful thing about this is they didn't invent the layered filesystem. Linux already has — it's the thing that makes images work. A base Ubuntu layer at the bottom, your installed packages above that, your application code above that, a writable scratch layer on top. The catch with OverlayFS is that it freezes its stack at mount time. You can't change which layers are in the stack while it's running. The system was built to be set up once and then left alone.

6:54Tyler: So the first real engineering move in the paper is forcing to allow runtime layer changes. They add a kernel-level command that says: take the current writable top sheet, freeze it into a read-only layer, drop a fresh blank writable layer on top, and do all of that atomically, without unmounting anything.

7:15Juniper: And critically, without breaking any of the file handles the already has open. That's the subtle part. The agent might be in the middle of reading some file when you swap the layer stack underneath it. They handle that with a generation counter on each file — when a stale handle is used, the kernel notices it points at an old generation and quietly redirects it to the current layer. The agent never knows anything happened.

7:44Tyler: And underneath the whole thing they use with turned on, which means when a file does get copied up from a read-only layer because somebody's editing it, the unchanged blocks share physical storage by reference. A hundred-megabyte file that's been checkpointed fifty times still occupies about a hundred megabytes on disk, not five gigabytes. The cost is proportional to what actually changed, not to how many checkpoints you took.

8:13Juniper: That's the part that really matters at scale. If you're doing a tree search with hundreds of nodes, naive snapshotting would absolutely destroy your disk. With , the storage cost is bounded by the actual edits, which for a coding is usually tiny.

8:32Tyler: Okay. So that's the filesystem half. The 's files are now cheaply versionable. But the agent itself — the running Python process, its memory, its call stack, the LLM client object, all of that — still needs to be captured too. And memory checkpointing is harder than filesystem checkpointing in a really fundamental way.

8:55Juniper: Right, because the process is alive. It's mutating things while you're trying to take its picture. So tell me how — the memory side — handles that.

9:06Tyler: This is where the trick gets really nice. So Linux has a tool called — Checkpoint/Restore In Userspace — that can serialize a running process to disk and bring it back later. It works, but it's slow. For a multi-gigabyte process, CRIU restore can take seconds. Way too slow for an doing hundreds of rollbacks. But Linux also has `()`. Forty years old. `fork()` makes a duplicate of a running process, and on modern Linux it's almost instantaneous because of — the duplicate doesn't actually copy any memory, it just gets a second pointer to the same pages, marked as shared. Real copying only happens when somebody writes.

9:52Juniper: So the question is how do you combine those two.

9:56Tyler: The combination is the whole trick. At checkpoint time, does two things at once. It starts a dump in the background — writing only the memory pages that changed since the last checkpoint, to fast tmpfs — and simultaneously, it forks the . The is the magic. The parent gets immediately frozen with a stop signal. It just sits there, motionless, in suspended animation. The child keeps running as the live agent.

10:26Juniper: And the frozen parent is essentially free, because of ?

10:32Tyler: Almost free. It's sharing all its pages with the live child. The only memory cost is the kernel bookkeeping — the page tables and so on. They measure it at about eleven megabytes of resident memory per forked child in their fan-out experiments, which is nothing. You can keep dozens of these snapshots sitting around.

10:53Juniper: So now I see the shape of it. The frozen parent is your fast restore path. If you want to roll back, you just kill the live child, and wake up the frozen parent — which springs back to life exactly as it was at checkpoint time, because nothing ever modified it. That's where the five-millisecond restore comes from.

11:15Tyler: Exactly. The dump is the durability path — if you crash, if the template gets evicted, if the system reboots, the CRIU image on disk still lets you restore, just a few milliseconds slower. It's the safety net. The -based template is the speed path. You get both.

11:33Juniper: The body-double analogy works really well for this. When the is about to do something risky, it clones itself. The original goes into perfect stillness — a frozen body double — while the clone goes out and does the work. If the clone messes up, you don't rebuild the agent from scratch. You just thaw the original.

11:55Tyler: And there's a nice asymmetry built in. Making the snapshot is cheap. Throwing it away if you don't need it is even cheaper. The expensive operation — actually restoring — only happens when you've decided the experiment failed and want to back up. So the average cost is dominated by the cheap operations, and that's how they get the headline numbers.

12:18Juniper: Now there's a third piece that I think is honestly the most satisfying part of the paper, and it's the thing that makes the whole architecture practical rather than just clever.

12:30Tyler: Inference-masking.

12:32Juniper: Inference-masking. So here's the setup. The 's loop looks roughly like: do something on the , send the result to an LLM, get a response, do the next thing. The LLM call takes anywhere from one second to twenty seconds — it's a network round-trip to a model, possibly running through a few hundred billion parameters. And critically, the agent is just sitting there waiting during that whole time. Idle.

12:59Tyler: And the whole checkpoint window — the SIGSTOP pause plus the dump-and- — is about fifteen milliseconds.

13:05Juniper: Right. So the authors notice — wait, we don't need to make checkpointing free. We just need to fit the checkpoint work inside the LLM wait. The was going to be idle anyway. Run the dump asynchronously during the network call, and the agent's perceived checkpoint cost is basically the one-millisecond filesystem operation. The expensive part is happening, but it's happening in time you were already losing.

13:32Tyler: My favorite analogy for this is the kitchen during the microwave wait. A bad cook stands there watching the timer count down. A good cook uses those twenty seconds to wash a knife, wipe the counter, start prepping the next dish. is just being a good cook. It does the checkpoint bookkeeping while the is doing its microwave wait on the LLM.

13:54Juniper: And once you see it, you can't unsee it. They're not making checkpointing free. They're hiding it inside the network call you were already waiting on.

14:04Tyler: Although — and I want to flag this now because it's going to come back when we talk about limits — this whole architecture depends on the LLM call being long. The microwave timer being on for twenty seconds is what gives you time to wash the knife. If LLM inference gets dramatically faster — , local small models, sub-second latency — the asymmetry narrows. The thing they're hiding has nowhere to hide.

14:32Juniper: That's a real critique and we should get to it. But let's see the result first, because the result is striking. On — which is the standard benchmark for AI coding , real issues from Django and SymPy and Astropy and a few others — they run tree search with -30B and measure where the time actually goes. The LLM-only floor — the absolute minimum time per instance if state management were literally free — is somewhere around three to five minutes. On with disk snapshots, the actual time is one-point-nine to three-point-eight times that floor. On a more naive baseline called CubeSandbox, it's two-point-six to four-point-three times the floor. On , it's basically the floor — one-point-zero-three to one-point-zero-six.

15:26Tyler: Which is a way of saying: with , the OS plumbing is essentially invisible. You're paying for the LLM and almost nothing else. The gap the paper opens with — the reason isn't deployed — that gap just closes.

15:42Juniper: And the way I'd frame the intellectual move here is: the limits people observe in systems often aren't model limits. They're infrastructure limits. Linear agents exist because anything richer was too expensive in the OS layer. Fix the OS layer and the strategy space opens up.

16:02Tyler: Okay, Juniper, I want to push on the second use case before we get to the critique, because I think it's where the practical impact really lands. The paper isn't just about inference-time search. It's also about training.

16:17Juniper: Yeah, this is the second act and it's the same mechanism doing a different job. Take it.

16:23Tyler: So when you train an policy with reinforcement learning, the basic loop is: spin up a bunch of parallel sandboxes — call it sixty-four of them — each starting from the same warm state. Each runs a . You score the rollouts, update the policy, and repeat. The GPU is doing the rollout generation and the training step. The sandboxes are the setup cost.

16:49Juniper: And the setup cost has to happen every step, because every step starts from a fresh warm state.

16:55Tyler: Right. And in synchronous training, the GPU is idle while the sandboxes get prepared. So the fraction of time the GPU is actually doing useful work is — informally — the model time divided by the model time plus the time. If your sandbox setup is slow, your GPU is idling. And GPUs are expensive.

17:16Juniper: How slow are we talking, on the naive baseline?

17:19Tyler: The standard approach is something like `shutil.copytree` — just copy the whole working directory N times. With sixty-four parallel on a seven-billion-parameter model, that puts GPU utilization at around fifty-one percent. Roughly half the GPU time is wasted waiting for filesystems to copy.

17:39Juniper: And with ?

17:40Tyler: Ninety-nine percent. They sixty-four children off the warm template in about five milliseconds at the median, fifteen at the tail. Per child, about eleven megabytes of resident memory because everything's shared. The GPU stops idling.

17:58Juniper: That's the closing argument right there. You can spend months tuning a training run, or you can fix the substrate and double your effective throughput overnight.

18:10Tyler: And it's the kind of result that I think will quietly matter a lot. RL for policies is one of the most expensive things happening in AI right now. Every hour the GPU sits idle is real money. A two-x throughput multiplier on training is the kind of thing that changes the economics.

18:29Juniper: Okay. So that's the case for the paper. Tyler, you've been holding fire on the critique. Let's hear the steelman.

18:37Tyler: So there are a few things I want to push on. None of them are killers, but they're worth voicing. The first is that the evaluation is concentrated in a narrow band. The workload is — four repositories, Django, SymPy, Astropy, Xarray. The "fat process" they measure is a Django with about three hundred and fifty megabytes of working directory and fifty megabytes of Python heap. That's not small, but it's modest by the standards of, say, a multi-gigabyte ML repo, or an doing long-running data analysis with substantial in-memory state.

19:18Juniper: And the concern there is that `()` cost grows with page table size?

19:23Tyler: Right. Fork is fast, but it's not infinitely fast. The page tables have to be duplicated even when the pages themselves are shared. They acknowledge that cost grows sub-linearly with process size, but the paper doesn't really stress-test what happens at, say, eight or sixteen gigabytes of resident memory. There's a regime where the elegant fork trick might start to creak.

19:50Juniper: That's fair. What else?

19:51Tyler: The trick depends on LLM round-trips staying long. We touched on this earlier. The whole architecture rests on a roughly fifteen-millisecond checkpoint window easily hiding inside a one-to-twenty-second LLM wait. If inference gets dramatically faster — and there's a lot of work right now to make it faster — that hidden time shrinks. The microwave timer goes from twenty seconds to two seconds and now you can't fit the knife-washing inside it.

20:24Juniper: Which is interesting because it means the architecture is partly riding on the current ratio of model latency to OS latency. If that ratio changes, the design might need to change with it.

20:37Tyler: Right. And the paper actually points this out indirectly — in one of their figures, you can see that as LLM inference gets faster, the state-management overhead becomes a larger fraction of the whole. They're not blind to it. But it does mean the headline numbers are somewhat coupled to the current generation of inference latencies.

20:58Juniper: A third one I want to flag — and the paper is honest about this — is that doesn't roll back network I/O or external side effects. If the sends an email, calls an external , writes to a database that lives outside the , the checkpoint can rewind the process but it can't undo the side effect. For coding agents that mostly edit files and run tests, this is fine. For agents that interact with the outside world, the transactional framing has hard limits.

21:29Tyler: That's a real one. And then a smaller, more technical point — the garbage collector that decides which old checkpoints to keep and which to discard is tightly coupled to 's specific selection rules. They build a smart policy that knows MCTS might re-select pruned nodes, so it doesn't garbage-collect those eagerly. But if you're running a different search algorithm — beam search variants, novelty-driven exploration — you'd need to write your own keep-rule. The "general OS abstraction" framing is a little stronger than the implementation quite supports.

22:04Juniper: And one I'd add from the methodology side — their headline comparison to CubeSandbox, one of the closest competing systems, uses a reconstructed version of CubeSandbox rather than the real production system, because the event-level rollback feature in CubeSandbox isn't shipped yet. They're up-front about the substitution and they use what they consider equivalent components, but it's a reconstructed baseline, not the real thing.

22:32Tyler: All of that said — I don't think any of these undermine the core contribution. The headline numbers are real, the mechanisms are clearly described, and the end-to-end results are compelling. These are honest "this is the regime where the design holds, here's where it might not" pushbacks, not a takedown.

22:53Juniper: Agreed. And I think the deeper contribution might actually be the reframing more than the system itself. There's a quiet argument in this paper that the bottleneck on what can do isn't always the model. Sometimes it's the OS underneath. Linear agents exist partly because the kernel makes anything richer too expensive. Fix the kernel piece — give people millisecond-scale and rollback — and a whole strategy space that was theoretically attractive but practically blocked suddenly becomes accessible.

23:29Tyler: And there's a pattern this fits into, which I think is worth naming. We're starting to see a wave of work that asks: what does an OS or a runtime look like when its main user is an LLM rather than a human at a terminal or a long-lived server process? was built for container images. was built for migrating batch jobs. `()` is forty years old. Reflinks were built for filesystem-level deduplication. None of these primitives were designed for high-frequency agent checkpointing. But each one has the right shape for some piece of the problem.

24:08Juniper: And the contribution is the co-design. Getting them to compose. Plus the small but load-bearing modifications — the kernel command that lets swap layers at runtime, the warm template pool, the network proxy daemon that isolates LLM I/O so the forked templates stay clean. None of these pieces alone is the breakthrough. It's the way they fit together that makes the whole thing hold.

24:34Tyler: There's also a thread here about the relationship between search and state that I find genuinely interesting. Classical AI search assumed cheap state — a game-tree node is just a data structure, branching is free. Modern search has expensive state — every node is a whole Linux environment with side effects. The question of how to make rich, mutating state cheaply branchable shows up in serverless cold-starts, in fuzz testing, in time-travel debugging, and now in agent search. is one specific, very effective answer to that recurring question.

25:11Juniper: Let me try to land where this leaves us. Before this paper, if you were building a coding and you wanted to do real backtracking search — try a fix, test it, back up if it fails, try a different one, hundreds of times — you had two choices. You could pay the seconds-per-checkpoint cost and watch your iteration budget evaporate. Or you could give up on backtracking and run linear with no rollback, which is what almost everyone actually does. The five-to-thirty-point accuracy gain from tree search was sitting there, theoretically available, practically unreachable. After this paper, the cost of a checkpoint is about fifteen milliseconds. The cost of a rollback is five. The state-management overhead drops from a majority of trajectory time to a few percent. The accuracy gain is reachable. And on the training side, the GPU utilization on RL fan-out jumps from about half to ninety-nine percent.

26:11Tyler: Which means somewhere in the next few months, the next wave of papers is going to start reporting results that assume this kind of substrate underneath. And the gap between "agents that can do real search" and "agents that can't" is going to start mattering more than the gap between models.

26:30Juniper: That's a good place to leave the substance. The show notes have a link to the paper and some related reading if you want to go deeper on either the OS plumbing or the -search side.

26:43Tyler: And if you want the full transcript with the technical terms defined inline, plus pages that link this episode to the other ones we've done on systems, that's all on paperdive.ai.

26:56Juniper: Thanks for listening to AI Papers: A Deep Dive.