All episodes
Episode 189 · Jul 02, 2026 · 24 min

Why Phone Agents Ace the Test and Crash on Your Actual Phone

Team, Qu, Luan

GUI Agents Mobile AI
AI Papers: A Deep Dive — Episode 189: Why Phone Agents Ace the Test and Crash on Your Actual Phone — cover art
paperdive.ai
Ep. 189
Why Phone Agents Ace the Test and Crash on Your Actual Phone
0:00
24 min
Paper
Xiaomi-GUI-0 Technical Report
Venue
arXiv:2606.31410
Year
2026
Read the paper
arxiv.org/abs/2606.31410
Also available on
Apple Podcasts Spotify

An open AI model scores 70% on the industry-standard phone-control benchmark — and 33% the instant you put it on a real device. This episode unpacks how Xiaomi doubled that real-world number by doing the counterintuitive thing: hunting for their 's failures on hundreds of physical phones and treating the wreckage as the most valuable training data they had.

What you'll take away

  • Why standard benchmarks systematically overstate performance — and why the abnormal states you most need to train on (login walls, fraud checks, captchas) can't be reproduced in a simulator at all
  • The 'failure flywheel': instead of keeping successes and discarding failures, Xiaomi mines failures for recovery data, keeping the wrong step in the model's context so it learns to climb back from a mess it already made
  • How a teacher model with 'dual controls' grabs the wheel only when the student drifts, then hands control back — producing recovery that success-only corpora can never contain
  • The three-stage training that refuses to reward clever reasoning until basic format and validity checks pass — dense feedback first, sparse full-task feedback last
  • Why basic UI operations are now saturated (everyone scores ~100%) while Safety and Reflection — knowing when NOT to proceed — remains unsolved across every model, frontier systems included
  • The honest catch: the headline 72%-vs-33% gap is measured on a benchmark the same team designed, built, and scored — and the recovery is from a stronger closed model

Chapters

  1. 01:53Why does the lab lie?
  2. 04:21Turning a phone farm into a classroom
  3. 06:15Keep the mistake in the context
  4. 10:02Grading format before genius
  5. 15:23Does any of it actually work?
  6. 18:28Measured on a yardstick they own
  7. 21:38Is reality the only honest teacher?

References in this episode

Also available as a plain-text transcript page.

0:00Juniper: A student driver aces the test — perfect marks on the closed course, every maneuver clean. Then you hand them the keys on a real road, at night, in the rain, and they're in a ditch before the first mile. That is not a story about driving. It's the story of nearly every mobile AI that crushes its benchmark and then falls apart on your actual phone. And the numbers make it concrete. An open model scores seventy percent on the industry-standard phone-control test — and thirty-three percent when you put it on a real device with a real account. Same kinds of tasks. Half its success, gone the instant it touches reality.

0:40Eric: Quick before we start — this is an AI-made explainer, both voices included.

0:46Juniper: That collapse is the whole subject of today's paper — a technical report out of Xiaomi called . And the claim underneath it stings a little: the benchmarks we've all been celebrating are testing in a world that doesn't exist. By the end you'll understand how they dragged that number from the low thirties up to seventy-two — and why the fix was to stop treating a phone's messy failures as a deployment headache and start treating them as the single most valuable training data they had.

1:18Eric: And here's why this matters even if you don't follow at all. The next wave of AI isn't the chatbot — it's the agent that actually does things. Books the train. Manages the cart. Pays the bill. The bottleneck may not be how smart the model sounds in a demo. It's whether it survives a , an expired login, a fingerprint prompt on the payment screen — all the junk that never shows up in the lab.

1:45Juniper: So let's start with the obvious question — why does the lab lie? A , quickly: it's an AI that drives your phone the way you do. It looks at the screen as an image, reads your instruction — "add these earbuds to my cart" — and then taps, swipes, and types. No special back-door access from the app maker. It uses the same buttons you do.

2:07Eric: Which is exactly why it's so exposed. If you had a clean programmatic hook into the app, you'd never see a . But this thing is at the mercy of whatever pixels actually land on screen.

2:19Juniper: Right. And the standard benchmarks run inside — simulated phones on a server. Cheap, resettable, and sanitized. Clean pages, predetermined states, and every task politely starts from the home screen. Real phones are a hostile, shifting environment. Your session expires mid-task. A permission dialog jumps in front of you. A risk-control system flags you as suspicious. And the cruelest part — many commercial apps actively detect emulators and refuse to run inside them, specifically to block this kind of automation.

2:53Eric: Which creates this trap the paper leans on hard. The abnormal states you most need to train an for — the login walls, the fraud checks — are the exact ones you physically cannot reproduce in a simulator. The apps won't even open there.

3:09Juniper: So the reframe is the whole paper in one move. Instead of training on clean, successful demonstrations in and praying the model survives the real world, Xiaomi made hundreds of physical phones the primary training environment — and built the to harvest the 's own failures on those phones as the core signal. The headline: seventy-two percent on their real-device benchmark, roughly double the best comparable open model, and competitive with much larger closed systems.

3:40Eric: I want to flag one thing early, because a sharp viewer is already forming it. That seventy-two-versus-thirty-three gap — the star result — is measured on a benchmark this same team designed and built. That doesn't sink it, but hold onto it. We'll come back to exactly how much it can bear.

4:00Juniper: Fair flag, Eric. Let's earn the number before we stress-test it. And it starts with the least glamorous part — the hardware. Because you cannot run a failure-harvesting loop on real devices if you don't have a fleet of real devices under programmatic control. So they built one: hundreds of physical phones spanning nearly ten brands, dozens of tablets, even in-vehicle systems, covering the hundred most-trafficked apps. These are Chinese apps, by the way — is roughly their TikTok, their Amazon, their Google Maps. Real work was done on real apps, so we'll keep the names, but map them as we go.

4:42Eric: And real phones are miserable to run at scale, right? They go offline, they lose login state, they overheat into cooldown.

4:50Juniper: Exactly, which is why it's a hybrid. They profile each app by behavior. Apps that run fine under virtualization get routed to a pool of sandboxes for cheap, reproducible bulk work. Apps that demand real accounts and live networks stay on physical hardware. One scheduling detail is genuinely clever: instead of pushing a task out to a device, idle devices pull tasks that match their current state — the right apps installed, logged in, low on risk flags. So you never assign a job to a phone that's about to drop offline. And there's a low-latency remote channel where a human can reach in, tap through an expired login or clear a , and keep a device warm and schedulable.

5:37Eric: So the infrastructure exists to guarantee one thing — that data collection, training, and evaluation all share the same messy distribution as real deployment. Which sets up the part that actually makes this paper interesting. You've got a farm of phones and an that fails on them constantly. What do you do with the failures?

6:00Juniper: And this is where I hand it to you, because the answer inverts everything.

6:05Eric: It does. Think about how most data pipelines work — the conventional flywheel. The runs, you keep the successful , you throw away the failures, you retrain on the wins. It's a student who only ever re-reads the questions they already got right. Xiaomi does the reverse. They deliberately go hunting for failures, and they mine them for something a success-only corpus can never contain: how to notice you've gone wrong, and how to climb back. There are two pieces to it. The first is a rule they call the "first key error." When a task fails, the mistakes cascade. One wrong tap drops you on the wrong page, and now every action after that is conditioned on a broken state — so most of the later errors aren't real decisions, they're just consequences of the first wrong turn. Take the wrong highway exit and every turn after is off, even if each one is "correct" given where you wrongly are. So annotators replay a failed run, find only the first decisive wrong step, and supply the right action plus a one-line reason.

7:13Juniper: And critically — they leave the wrong step in the history, don't they? They don't clean it up.

7:19Eric: That's the move. They keep the mistake in the context. So the model isn't learning "here's the right action from a clean start." It's learning "here's the right action from inside a mess you already made." Recovering from a bad state is a genuinely different than doing it right the first time — and this is data that teaches the different skill. But the piece that makes the whole philosophy click is the second one. Watch the screen for this. The student model is driving a real phone — tap, swipe, tap. Off to the side, a stronger teacher model is scoring every single step as it happens. For a while the scores are fine. Then the line drops — and stays low, several steps in a row. The student has clearly wandered off the path, and it hasn't noticed. Right there, the teacher grabs the wheel. It takes over for a bounded handful of steps, demonstrates how to recover from that exact broken state back onto a workable path — and then hands control straight back to the student.

8:23Juniper: It's a driving instructor with dual controls. The student does almost all the driving, so most of the stays in the student's own distribution — but the moment they drift into the , the instructor steers just enough to save it, then lets go.

8:40Eric: That's the image exactly. And what comes out the other end isn't a thumbs-up or thumbs-down label. It's a rich record: here's the error, here's why it's wrong, here's the sequence that climbs out of it. The authors are blunt that this is what conventional flywheels miss — by keeping only successes, they produce almost no data that bridges an error state back to a correct path. That bridge is the whole point.

9:06Juniper: And the one honest caveat inside the mechanism — the "first key error" rule assumes later mistakes are basically always downstream of the first. That's intuitive, but the paper asserts it more than it proves it. If the recovery itself introduces a fresh, independent mistake, that rule might quietly throw away signal.

9:28Eric: Noted, and I'll sharpen that later. But as an organizing principle it's clean: the highest-value thing you can show a sequential decision-maker is how to get itself unstuck.

9:41Juniper: Okay. So we've got the phones, and we've got a way to manufacture recovery data. The technical core is how they actually train on it — three stages, and the payoff is a reward design that refuses to pay the model for clever reasoning until the basics check out. Let me set up the two flavors of learning first, because the whole rides on the difference. There are two ways to teach this model. Supervised is imitation — you show it correct examples and it copies them. Learning to drive by watching someone else drive. Reinforcement learning is learning by doing — the model tries things, gets a grade, and adjusts. And that grade is the reward: a number that says "that was good" or "that was bad." The trick across the whole pipeline is that they move from dense feedback to sparse feedback.

10:38Eric: And that distinction — dense versus sparse — is the spine of the three stages, so let's make it concrete. Dense feedback means you can judge a single move on its own: that tap was malformed, or that reasoning contradicts the action it took. Sparse feedback means you only find out at the very end whether the whole twenty-step task actually succeeded. Sparse is realistic, but it's brutal to learn from — you failed after twenty steps, now which step was to blame?

11:11Juniper: So stage one is plain imitation. Supervised on something like 1.2 million step-level examples — it teaches the output format and the basic operations. Then it gets more interesting.

11:25Eric: Stage two is Step , and here the feedback is dense — you're grading one response in isolation. Can I tell this action is wrong just by looking at the single move plus its reasoning? And this is where the reward design is quietly elegant. The naive way to score a response is to rate it on several dimensions — is it well-formed, is the action valid, is the reasoning coherent — and blend them with . But now you've got fragile and scales you can't calibrate. So instead they use what's basically an assembly-line inspection, cheapest checks first, and you stop at the first failure. Check one, a cheap parser: is the response even malformed? If so, big penalty, done. Check two, still cheap: is the action structurally valid, does the reasoning have the right shape? Fail, smaller penalty, done. Only if both pass do they call the expensive part — an AI acting as judge — to ask the real question: does the reasoning actually match the state, and do the reasoning, the action, and the all line up? Pass everything, positive reward.

12:41Juniper: So two things fall out of that ordering for free.

12:44Eric: Right — it bounds the cost, because the pricey judge only ever runs on responses that are already clean. And it enforces necessary conditions before sufficient ones. You never get rewarded for sophisticated reasoning while you're violating a basic rule. You can't earn points for a beautiful essay that's written in the wrong language.

13:08Juniper: And then stage three is where it goes real.

13:11Eric: Stage three is Agentic , and the difference from Step RL is just the feedback granularity — this is the sparse end. Now the grade is on the whole : did the multi-step task actually succeed? That's what forces long-horizon planning, carrying memory across pages, and recovering after a mistake. And the design principle they name is exactly the teaching intuition — don't grade someone on a whole road trip before they can reliably start the car. Each stage bootstraps the next.

13:46Juniper: There are two more bits of machinery here worth a plain gloss, because they show up in the results. One is the objective itself — . The one-line version: instead of nudging the model by token, it grades the whole answer as one package — the reasoning plus the tap it produces — because a action is right or wrong as a unit, not word by word.

14:11Eric: And it grades on a curve, essentially — , it grades relative to a group. Sample a handful of tries for the same screen, score each one against the group average. Beat your peers, you get reinforced; fall below them, you get suppressed.

14:27Juniper: The other is curriculum sampling, which is the most human piece of the lot. They track how often the current model succeeds at each task, and shift practice time from the easy tasks toward the ones it's currently flunking — and "hard" is defined relative to where the model is right now, so the target keeps moving as it improves. A tutor who keeps drilling whatever you're failing this week.

14:54Eric: So the before we look at results: the infrastructure guarantees real-world data, the flywheel manufactures recovery data out of failures, and the three-stage teaches format first, single-step correction second, and full-task success last. Now — does any of it actually work?

15:15Juniper: Let's do it as a prediction. If the thesis is right — that training on the real distribution is what matters — then the model should look ordinary on the sanitized tests and pull away exactly where reality bites. And that's the pattern. On the public grounding benchmarks, is middle of the pack — it even trails several models. But on RealMobile, their real-device benchmark, it hits seventy-two percent, against thirty-three for that comparable open model we opened with.

15:48Eric: And this is the beat I'd rewind for. is a thirty-billion- model — and a sparse one, only about three billion parameters active at a time. On RealMobile, that model beats at sixty percent. It beats at fifty-eight. It roughly triples one of the Claude versions sitting at thirty-three.

16:11Juniper: Wait — a mid-sized open model beating those frontier systems? On what grounds?

16:17Eric: On the grounds that it was trained on the exact distribution the benchmark draws from, and they weren't. Which is real, and it's also — I'll say it now — the seam in that comparison. Hold that thought.

16:30Juniper: And two results in there tell the honest story better than the headline. First: on basic UI operations — the taps and swipes, the foundation — scores a hundred percent, matching the top proprietary models. The paper's own read is that basic operation no longer tells capable apart. It's saturated. The game has moved entirely to reasoning, memory, and recovery.

16:55Eric: Which is the irony worth sitting on. Everyone can drive the car now. Nobody's competing on whether you can tap the button anymore — they're competing on what you do when the road disappears.

17:08Juniper: And the second result is the one that keeps this from being a press release. The weakest domain — for every single model tested — is Safety and Reflection. That's the domain that asks whether an knows when not to proceed. Refusing to sign up for a paid membership it wasn't authorized to buy. Recognizing that a task is just impossible — like planning a cycling route from Beijing to New York.

17:34Eric: And nobody's close. leads the open models there at about forty-four percent — but even the strongest frontier system, , only manages sixty-two and a half. The of self-restraint, of knowing when to stop, is unsolved across the board. That's a rare thing for a technical report to just... say out loud about its own numbers.

17:58Juniper: One more detail I liked, on that safety instinct — for payment tasks, the whole system stops execution at the final confirmation page. It does all the reasoning, gets you to the checkout, and then refuses to actually spend your money. The restraint is baked into the , not just hoped for.

18:18Eric: So the results hold up on their own terms. Which is exactly where I want to push, because "on their own terms" is doing a lot of work. Here's the reservation, and I think it's the fair one. The most dramatic claim in this paper — that seventy-two-versus-thirty-three real-world gap — rests on a benchmark the same team designed, built, and scored. RealMobile is a hundred tasks. Some sub-dimensions have six or seven tasks in them. They report averages over four runs specifically to handle what they call the large evaluation variance — which is itself a tell that at this scale the numbers are noisy. It's not disqualifying; they clearly aimed for rigor, with executable rules and veto conditions for irrecoverable errors. But the headline is measured on a yardstick they own.

19:13Juniper: And to be fair to them, they didn't hide the counter-evidence. On the public benchmarks they don't lead.

19:21Eric: They don't — and that's the second piece. They wave it off as distribution mismatch, English-and-desktop tests versus their Chinese-mobile target, which is plausible. But it means "we beat everyone" only holds on the metric they control. Then there's the teacher. The whole recovery flywheel distills from a stronger closed grabbing the wheel. So how much of this is genuinely learned robustness, versus imitating a better model's judgment on exactly the dimension where the teacher is doing the correcting? The student may not be able to surpass the thing it's learning recovery from.

20:02Juniper: And the beating-bigger-models line runs into the same wall you flagged earlier.

20:08Eric: It does. and were tested cold — general-purpose, , on tasks drawn from a distribution was purpose-built and trained on. Beating them there is real and it's useful, but it partly reflects task-specific training, not a clean head-to-head on . And the "first key error" assumption we flagged back at the flywheel — that later mistakes are always derivative — the efficiency gain is real, but whether it silently drops signal is asserted, never measured. None of this makes the work wrong. It makes it a strong engineering-and-methodology contribution wearing a slightly triumphant benchmark.

20:51Juniper: And I'll concede all of that. It is an assembled system — , curriculum sampling, judge models, teacher are mostly borrowed. The novelty isn't a new algorithm. It's the organizing principle and the sheer infrastructure investment — which, honestly, is the part nobody can cheaply replicate. You need the phone farm.

21:15Eric: And that's the thing I can't fully resolve, so I'll leave it open: the most impressive result is also the least independently verifiable, because the only people who can build the environment to check it are the people who already built one.

21:31Juniper: Which is the right note to zoom out on. Because the durable takeaway here isn't the model, and it isn't even the seventy-two percent. It's the reframe. For an operating in a stateful, high-consequence world, learning how to recover from your own mistakes is a categorically different than learning how to do things right — and the training data for that skill does not exist unless you're willing to fail on real hardware and go pick through the wreckage. Static corpora over-represent correct actions in correct states, so models glide when they're on track and flail the moment they've slipped. Deliberately manufacturing the recovery data attacks the actual bottleneck.

22:18Eric: And that generalizes way past phones. Any in an environment with real consequences — the same logic applies. The failures aren't the noise around the signal. They might be the signal.

22:32Juniper: So here's the question to leave you with, because the whole paper turns on it. Is the path to trustworthy building expensive real-device farms like this one — accepting that reality is the only honest teacher — or is that brute force a crutch, and the real win goes to whoever builds a simulator faithful enough that you never have to touch a real phone? If you've shipped anything that has to survive contact with a live app, you probably already lean one way. Drop where you land in the comments.

23:07Eric: The full annotated version of this episode is on paperdive.ai — every technical term tap-to-define, with links to the related papers grouped by theme, plus our weekly and monthly roundups.

23:21Juniper: Quick housekeeping on the way out: this script was written by Anthropic's , Eric and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is the Technical Report, posted yesterday, June 30th, 2026 — we're recording July 1st.

23:42Eric: So the next time an aces its benchmark, ask the only question that counts — not how it does on the closed course, but whether it can find its way back after it takes the wrong exit. That's the road that's actually hard to drive. See you in the next one.