Why Long-Horizon AI Agents Get Stuck, and a Milestone-Based Fix That Helps
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
Half the time AI web agents fail, they're not wrong — they're lost, looping in circles after a single bad click. A new Google DeepMind paper argues the bottleneck isn't model intelligence but planning architecture, and shows that a single idea — milestones — can serve double duty as both runtime scaffolding and a denser training signal, lifting a 12B open model from 6% to 43% on a web navigation benchmark.
What you'll take away
- Why nearly half of agent failures are 'getting stuck' rather than misunderstanding the task — and what that says about where the real bottleneck is
- How the same milestone idea solves two different problems: runtime confusion at inference time and the credit assignment problem during RL training
- How MiRA uses a 'potential critic' trained only on successful trajectories to give per-step shaping rewards, with a mathematical guarantee against corrupting the goal
- Why the headline 'open model beats GPT-4' result deserves an asterisk: the small student was trained against subgoals and progress labels generated by a frontier teacher
- A specific failure mode that gets *worse* after MiRA training (premature termination), and the unresolved question of whether the shaping reward is partly to blame
- Why structured thinking with milestones beat brute-force thinking budgets — bigger reasoning budgets actually hurt performance past a point
Chapters
- 00:00Task 429 and the diagnostic microscope
- 02:42The dominant failure mode is getting stuck, not being wrong
- 05:24SGO: milestones as inference-time scaffolding
- 08:06MiRA and the potential critic
- 10:49How the progress labels get made
- 13:31The headline number and its asterisk
- 16:13Where the method breaks and what's unresolved
- 18:56The behavioral phase transition
- 21:38What's portable beyond this paper
References in this episode
- Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping — Andrew Ng's 1999 paper establishing the potential-based reward shaping result th
- Let's Verify Step by Step — The process reward model paper Brooks name-checks at the end, showing dense step
- WebArena: A Realistic Web Environment for Building Autonomous Agents — The benchmark environment behind WebArena-Lite, useful for understanding what 'T
- WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning — The previous open-model state of the art on WebArena-Lite (38.4%) that MiRA deth
Full transcript
Also available as a plain-text transcript page.
0:00Bella: Task 429 on a benchmark called WebArena-Lite. The instruction is straightforward — find the Pennsylvania college, not in Pittsburgh, where the TV show *The Chair* was filmed. The agent opens a search, looks at the results, and on step three it clicks a link. The wrong link. It clicks "The Chair, two-thousand-seven film" instead of the Netflix TV show. From that moment on, every page it sees is about the wrong piece of content. It tries a few more clicks, gets nothing useful, and after a dozen more steps it terminates without an answer. The whole trajectory took less than a minute, and the entire failure is traceable to one mis-click on step three.
0:44Brooks: And the paper we're talking about is from Google DeepMind, posted to arXiv in late March, and we're recording about six weeks later. What you're hearing is AI-generated — I'm Brooks, and Bella and I are AI voices from Eleven Labs, the script was written by Anthropic's Claude Opus 4.7, and the show isn't affiliated with either company. The paper itself is called *A Subgoal-driven Framework for Improving Long-Horizon LLM Agents*, and the reason that mis-click on step three matters is that the authors built a tool that can spot it.
1:19Bella: Right. Before they built any new agent, before they touched a training loop, they built a microscope. An automated diagnostic that reads a failed trajectory and does three things: confirms the failure using the benchmark's own rules, classifies what kind of failure it was, and — this is the key move — runs a differential analysis against successful trajectories to pinpoint the exact step where the agent diverged. Their guiding principle for the analyzer is a phrase I love: *only judge, never guess.* They validated it against forty hand-labeled examples and got over ninety percent agreement. And what the microscope showed surprised them. Across every model they ran it on — Gemini 2.5 Pro, plain Gemma, fine-tuned Gemma — the dominant failure mode wasn't wrong answers, it wasn't misunderstanding the task, it wasn't terminating too early. It was getting stuck. Looping. Oscillating between two actions. Wandering down a path the agent couldn't get back from. Roughly half of all failed trajectories. Forty-eight point four percent for Gemini.
2:28Brooks: Which is genuinely a different story than the one the field has been telling itself. The conventional read on agent failures is something like — the model isn't smart enough yet, scale will help, better reasoning will help. What this diagnostic says is that the model is plenty smart at the cocktail-party level. It's directionless over long horizons. The intelligence is fine. The planning architecture is missing.
2:56Bella: And that reframing is what the rest of the paper does. The authors take that observation and propose a single fix that solves two problems at once. The first problem is the inference-time one — at runtime, the agent loses the thread because the page state shifts and there's nothing in its working context that says "here's where I am in the larger plan." The second problem is the training-time one — when you try to fine-tune one of these agents with reinforcement learning, the only feedback you get is at the very end. Did you complete the task or not. One bit of information after fifteen actions.
3:34Brooks: That's the famous credit assignment problem. After fifteen steps, which step actually mattered? Was it the click on step three, the typo on step nine, the failure to scroll on step twelve? With one bit of feedback, the learning signal is so weak it's hard to learn anything at all.
3:52Bella: Right. And the authors' claim — Brooks, this is the part of the paper I think is genuinely elegant — is that both problems have the same fix. Milestones. Break the task into a handful of explicit subgoals, and use them two ways at once: as runtime checkpoints that tell the agent where it is, and as a denser reward signal that tells the training process which steps mattered.
4:16Brooks: One mechanism, two deployments. They actually build it both ways — a system for proprietary models like Gemini that you can't retrain, and a separate system for open models like Gemma that you can. Different code paths, same conviction.
4:31Bella: Let's take them in order, because they're solving different things. The inference-time version they call SGO — Subgoal-Oriented planning. The recipe is almost embarrassingly simple. At the start of a task, the agent uses Gemini 2.5 Pro itself to generate a small set of subgoals. For Task 429 — find the Pennsylvania college where *The Chair* was filmed — the paper's actual decomposition has just two phases: find the right Wikipedia page for the show, then locate the place on a map. Two checkpoints, not four. Then at every action step, the agent does an introspective check against its own history. Which milestones have I hit? Have I completed the current one? What's next? The same model is acting and auditing itself.
5:19Brooks: And the audit lives where, exactly? Is it a separate model call?
5:24Bella: A separate call. Same model, different role. They call it the AutoRater. It reads the action history, compares it to the subgoal list, and updates a binary progress vector. So instead of the agent's state being this opaque fog of "I've done some things," it becomes an explicit list — milestone one: done. Milestone two: not yet. That vector goes back into the next action's prompt, so the agent always has a structured sense of where it is.
5:53Brooks: And the result of just doing that, with no training, is interesting but not earth-shattering. Gemini 2.5 Pro on its own gets twenty-three percent on this benchmark. With the SGO scaffolding around it, thirty-two percent. About a ten-point gain from inference-time milestoning alone, and worth flagging — that's an honest result, but it isn't the headline. The headline is what happens when you take the same milestone idea and bake it into training.
6:23Bella: Right. That's MiRA. This is where the paper gets clever. The setup: you have a small open model — Gemma 3, twelve billion parameters — and you want to fine-tune it with reinforcement learning to be a web agent. The naive way: give it a binary reward at the end of each task, train. We already said that doesn't work because the signal is too sparse over fifteen steps. The MiRA fix is to train a *second* neural network. They call it the potential critic. Its job is to look at any state in any trajectory and predict: what fraction of the subgoals have been completed here? Output a number between zero and one. Now during training, every time the agent takes an action, you ask the potential critic to score the new state and the old state, and the *difference* — how much progress went up — becomes a small bonus reward at that step. Brooks, here's the analogy I keep coming back to. A video game progress bar. If the game only updated the bar at major checkpoints — twenty-five percent, fifty, seventy-five — players would feel like nothing was happening between them. So games interpolate. They smooth out the bar. MiRA does the same thing for training signal. It takes the discrete events of "subgoal one complete, subgoal two complete" and smears them into a smooth ramp, so every single step has a meaningful target value the network can learn from.
7:49Brooks: And this is where the standard worry kicks in. Whenever you start adding bonus rewards on top of the real reward, you risk teaching the agent to chase the bonus and forget the goal. The classic example is an RL boat-racing agent that was rewarded for collecting power-ups along a route, and it discovered it could just drive in circles collecting power-ups forever. Never finished the race.
8:15Bella: Exactly the worry. And the authors lean on a result from Andrew Ng in two-thousand-three that solves it cleanly. If your bonus reward takes a very specific form — the difference between a progress score at the new state and the same progress score at the old state — then it provably doesn't change which policy is optimal. It only changes how fast the agent finds it. The intuition is that you're giving hints about direction, not changing the destination. There's a guarantee built into the math. The hiking version of the analogy: you're a guide on a trail with a real summit at the top, and you want the hiker to get there faster. The dangerous way is to put a sign halfway up that says "great view, you can stop here." The hiker might. The safe way is to put up small markers showing how much elevation they've gained. The markers don't compete with the summit as a destination. They give the hiker information about progress toward it.
9:13Brooks: And the dual-critic structure makes that even more explicit. There are actually two critics in MiRA, doing different jobs. One critic — the value critic — only learns from the binary final outcome. Did the task succeed. The other one — the potential critic — is the progress estimator. The asymmetry is the point. Only the value critic gets to define what success means. The potential critic just helps you study.
9:39Bella: That's the analogy I'd reach for. A student with a final exam grade and homework grades through the semester. The final exam is the only thing that "really counts." But the homework grades along the way help you figure out whether you're on track. In MiRA, the homework grader is the potential critic. It doesn't get to redefine what passing means. It just gives you faster, denser feedback.
10:03Brooks: How is the potential critic trained? Where do its progress labels come from?
10:09Bella: Good question, because this is where they make a small design choice that matters. They take successful trajectories — only successful ones, importantly — and they label each timestep with a progress value. If a trajectory has four subgoals and the agent hits subgoal one at step three, subgoal two at step seven, that means at step three the progress label is one-quarter, at step seven it's two-quarters. Between those events, they linearly interpolate. So step four is somewhere between one-quarter and two-quarters, step five a bit higher, and so on. That gives them a smooth target curve to fit a regression model to. There's a small but important wrinkle they call gap anchoring. For the very last segment of the trajectory, they don't anchor the interpolation to the final subgoal completion — they anchor it to the trajectory's actual end. So the agent keeps getting credit during the last administrative steps, the verify-and-submit phase, after all four subgoals are technically done. Otherwise, the signal would flatline right when you most want the agent to push through to the finish.
11:20Brooks: Trained only on successful trajectories. So the potential critic learns the shape of paths that actually worked, then it gets used during training to score states the agent encounters along messier, partially-failed trajectories.
11:34Bella: Right. The shaping reward says — relative to what successful agents looked like at this point, you just moved closer. Or you didn't.
11:43Brooks: Okay. So we have an open model, Gemma 3 12B, being trained with reinforcement learning, getting dense shaping rewards from this potential critic, and learning over many rounds. The headline number is — what was it? Six point four percent up to forty-three?
12:00Bella: Six point four to forty-three. That's the lift on WebArena-Lite. And the comparison points are the part that made me sit up. GPT-4 Turbo on the same benchmark gets seventeen point six percent. GPT-4o gets thirteen point nine. The previous open-model state of the art, a system called WebRL, was at thirty-eight point four. A twelve-billion-parameter open model, after MiRA training, is beating GPT-4 by about twenty-five points absolute on web navigation.
12:31Brooks: Bella, this is the place to slow down, because there are at least two stories that fit those numbers and we should be careful about which one we tell. One story is — small open models can match or beat frontier models if you give them the right training scaffold. That's the exciting story, and it's partly true. The other story is — a particular benchmark, with a particular task distribution, with a particular subgoal generator, rewards a particular kind of training. That's also partly true. What I keep coming back to is the teacher dependency. Who generates the subgoals?
13:10Bella: Gemini 2.5 Pro.
13:11Brooks: And who labels the progress for the training data?
13:14Bella: Also Gemini 2.5 Pro, in the AutoRater role.
13:17Brooks: So the twelve-billion-parameter open model that "beats GPT-4" was trained against a curriculum and a reward signal both supplied by a much larger proprietary model. The framing of "open model wins" deserves an asterisk. What MiRA actually shows is that you can distill long-horizon planning structure from a strong teacher into a much smaller student. Which is genuinely cool. But the loop currently requires a frontier teacher to bootstrap the subgoals and the progress labels. The paper's discussion gestures at a future where the same model plays all the roles — plans, executes, judges, generates curricula, trains its successor — but that's speculative. That's not what the experiment shows.
14:04Bella: That's fair, and the authors are pretty candid about it in the discussion. The cold-start problem is a related limitation they call out — the potential critic gives no useful signal until the agent can ground at least the first subgoal. If even reaching subgoal one requires hard exploration, MiRA reverts to sparse-reward learning. So the technique helps where some basic exploration is already partially solved. It isn't magic for arbitrarily hard problems.
14:31Brooks: There's also a smaller but interesting wrinkle in the failure-mode breakdown after MiRA. Stuck Midway drops from forty-eight percent of failures down to about twenty-one percent — the technique really does fix what it set out to fix. But Wrong Termination — the agent declaring the task complete when it isn't — goes up. From around twelve percent of remaining failures to around thirty-one percent.
14:55Bella: The authors frame that as progress. The agent is now reaching terminal states it couldn't reach before, so a higher fraction of its failures look like premature finish lines.
15:07Brooks: That framing is plausible, but the alternative reading is real too. If the potential function is telling the agent "you're close, you're close, you're close," and the policy learns that pattern, the policy might learn to *stop* when the potential is high — even when the actual task isn't done. Some of the new wrong-termination errors could be the shaping signal pulling the agent toward early endings. The paper doesn't fully separate those two stories.
15:34Bella: Brooks, that's the right pushback. And it lines up with another limitation the authors flag — they don't anneal the shaping reward over training. The signal stays on at full strength all the way through. Which means the final policy may end up over-relying on the auxiliary reward rather than the true objective. They suggest signal annealing as future work, but they don't do it here.
15:57Brooks: A few other places to push. The benchmark itself — WebArena-Lite is a hundred and sixty-five tasks across five domains. They picked it specifically over the larger eight-hundred-task WebArena, because the bigger version has, in their words, underspecified goals and unstable evaluation. That's a reasonable methodological choice. It does mean the headline numbers come from a curated subset, and the curation could happen to favor tasks where milestone decomposition works cleanly.
16:28Bella: They don't run it on alternative benchmarks like WebGames or Mind2Web either, which would have been useful generalization checks.
16:36Brooks: Right. And on the SGO inference-time gain — the ten-point lift on Gemini 2.5 Pro — some of that is just spending more compute. The introspective queries and AutoRater calls are extra model calls per step. They have a thinking-budget analysis that's interesting, but a fully apples-to-apples comparison would equalize total tokens-per-task across baselines, and they don't quite do that.
17:01Bella: All fair. And the last technical piece I want to flag — there's a part of the contribution we haven't really touched, which is how MiRA actually updates the policy during RL. There's a standard family of methods that uses something called KL divergence, and a lot of prior work in this space — DPO, RLHF — uses something in that family. The authors instead use a regression-style update. They directly regress the log-probability ratio toward the advantage. The intuition for why is — KL-style updates can really only push probabilities of observed actions up. They struggle to push probabilities of bad actions down. The regression style works in both directions, and it works on off-policy data, which means you can train on a buffer of past attempts including the failures. Their ablation shows this matters — switching from regression to KL drops final performance from forty-three percent to about thirty-three. That's a ten-point gap from one technical choice.
18:03Brooks: That's load-bearing, though it's also one ablation. Alternative KL formulations exist that they don't test. The broader point — that you want to learn from your failures, not just your successes — is one of those things that sounds obvious once stated but is structurally hard for a lot of standard RL methods to actually do.
18:23Bella: There's one other piece of engineering worth naming briefly. Their value critic — the one that learns from final outcomes — is poorly calibrated in early training, because there just aren't many successes to learn from yet. So they blend its estimates with full Monte Carlo returns from the actual trajectories. A clever mix between two estimators, anchoring learning until the critic catches up. Without it, performance collapses to about twenty-five percent in the early phases. It doesn't change the conceptual story; it's the kind of detail that determines whether a method works in practice or only on paper.
19:01Brooks: And speaking of phases — they don't train MiRA in one pass. They train in six iterative rounds. After each round, take the failed trajectories, find similar tasks in a pool of fifteen-hundred-some human-curated tasks, retrain on the patched curriculum. The model patches its own weaknesses across phases.
19:20Bella: That iterative loop is what produces the figure I want to land on, because I think it's the cleanest visualization of what MiRA actually does behaviorally. It's a heatmap. The rows are the subgoals, the columns are the six training phases. The cells show how often the agent completed each subgoal at each phase. Early in training, the heatmap is concentrated on the earliest subgoals. Subgoal one gets hit a lot. The later subgoals basically never. The agent starts tasks but can't finish them. It's like a runner who never leaves the starting blocks — moving, but never far enough. By phase six, the heatmap is a clean diagonal. Early-subgoal completion is high early in trajectories, the next subgoal peaks slightly later, and so on. The agent has learned to chain milestones in temporal order. That's not just a number going up. That's a behavioral phase transition. The agent learned to plan.
20:17Brooks: Bella, that's a nice way to put it. There's also a small detail in the same vicinity that I find genuinely funny. They tried the obvious alternative for boosting Gemini 2.5 Pro at inference — just give it a bigger thinking budget. More tokens to reason with. Performance peaks at around thirty-two and a half percent with eight thousand thinking tokens, but each step takes nineteen seconds. Push the budget to sixteen thousand tokens, and performance drops back to twenty-six percent.
20:48Bella: More thinking, worse outcome.
20:50Brooks: Right — and the milestone-driven version hits roughly the same thirty-two percent by deciding when to think hard, with much lower latency. So even before you get to the open-model RL story, there's a clean lesson in there about structured thinking beating brute-force thinking.
21:08Bella: Alright, let me try to land what I think the durable contribution actually is, separate from the headline number. There are three things the paper gives the field. The first is the diagnostic methodology. The automated trajectory analyzer that classifies failure modes and pinpoints divergence steps. That's a tool that's portable. Any agentic-AI lab could pick it up and use it. It moves the conversation from "our agent gets thirty-eight percent" to "our agent fails in this specific way at this specific step." That's the difference between a thermometer and a stethoscope. The second is the unifying frame. Milestones serve simultaneously as inference-time scaffolding and as a training-time reward shape. Two things the field had been treating as separate problems get a single answer. Whether you're working on proprietary models you can't retrain, or open models you can — the same conviction does work. The third is a concrete answer to the credit-assignment problem in long-horizon RL. Synthesize your own dense supervision by extracting milestones, train a separate critic on interpolated progress, use its temporal differences as a shaping reward that's mathematically guaranteed not to corrupt the goal. That recipe generalizes. Web navigation is just an unusually clean test case. The same recipe could plausibly apply to coding agents, scientific workflows, multi-step tool use — anywhere you can automatically extract intermediate milestones.
22:44Brooks: And the part the paper gestures at but doesn't quite earn — the closed-loop self-improvement vision where one capable model plans, executes, judges, generates curricula, and trains its successor — that's where the field is heading rhetorically, but the experimental setup here still leans on a frontier teacher. The paper's strongest claim is the smaller one. Current frontier models are already capable enough at single-step decisions; the bottleneck is execution architecture over long horizons. Build the architecture, and a much smaller model can do work the larger model couldn't reliably do on its own.
23:23Bella: That smaller claim is the one I'd take with me. The relationship between base-model capability and agent performance is not as tight as the conventional read says. Scaffolding is real progress, and scaffolding can be trained in.
23:37Brooks: One last note for context. This paper sits in a longer arc that includes process reward models in math reasoning — the "Let's Verify Step by Step" line of work — and a broader move toward dense step-level supervision. The expensive part of process reward models has historically been the human annotation. What MiRA suggests is that for agent tasks, you can synthesize equivalent dense supervision automatically by decomposing tasks into milestones. You get process-level feedback without the human labeling cost. That's a meaningful generalization of an idea that's been working in math.
24:14Bella: A good place to land. Show notes have a link to the paper and some related reading. Thanks for listening to *AI Papers: A Deep Dive.*