All episodes

Episode 115 · Jun 04, 2026 · 24 min

Teaching a Phone Agent to Reason Silently, And Keeping It Honest

Yang, Hu, Hao et al.

Mobile GUI Agents

AI Papers: A Deep Dive — Episode 115: Teaching a Phone Agent to Reason Silently, And Keeping It Honest — cover art

paperdive.ai

Listen

Ep. 115

Teaching a Phone Agent to Reason Silently, And Keeping It Honest

0:00

24 min

Concepts in this episode

AI Agents AI Efficiency & Cost Training Methods Chain of Thought Computer-Use Agents Agent Benchmarks Parallel Sampling Inference Cost Multimodal Models Ablation Studies Supervised Fine-Tuning Iterative Refinement Long-Horizon Tasks

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

Venue

arXiv:2606.04627

Year

2026

Read the paper

arxiv.org/abs/2606.04627

Also available on

Apple Podcasts Spotify

Good mobile AI agents write a paragraph of reasoning before every tap, which makes them smart but painfully slow. This episode unpacks MIRAGE, which moves that reasoning into silent hidden vectors, parallelizes it with a century-old numerical trick, and forces it to stay sharp by predicting the next screen, matching the quality of written reasoning at roughly a fifth of the cost.

What you'll take away

Why stripping reasoning out of an agent doesn't just remove a bonus but actively drops it below the untouched base model (42.9 to 31)
How APLR borrows Jacobi iteration to parallelize sequential latent reasoning with a provable guarantee that the first K thought-slots are exact
The trick that keeps invisible reasoning honest: a throwaway 'world model' head that forces the silent slots to predict the next screen's features during training only
How the ablation table tells the whole thesis in five numbers, with the world model recovering the chain-of-thought score (52.6) to the decimal
Where the headline 'matches chain-of-thought' claim is fragile: it rests on a tie at a single benchmark number, and the slot-specialization story is shown correlationally, not proven
Why the latent scratchpad isn't free, dropping from nine slots to three craters success from 52.6 to 32.8

Chapters

00:00The cost of agents that narrate every tap
03:01Reasoning without words
06:02APLR and the Jacobi iteration trick
09:03The world model that keeps silent reasoning honest
12:04Two-stage training and why ordering matters
15:05The ablation table, five numbers that carry the argument
18:06Where the claims are fragile
21:07What travels beyond phones

References in this episode

Training Large Language Models to Reason in a Continuous Latent Space — The 'Coconut' paper named in the episode as MIRAGE's direct ancestor — the work
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents — The live on-device benchmark of 116 task instances across 20 apps that anchors e
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — The foundational case for the visible 'show your work' reasoning that MIRAGE tri
AndroidControl: A Dataset for Mobile Device Control — The static, ground-truth-action benchmark behind the episode's 'cleanest single

Full transcript

Also available as a plain-text transcript page.

0:00Juniper: Picture an AI agent driving your phone for you. You tell it, "send Alice an email," and it starts tapping through the interface — opening the mail app, finding the compose button, typing. But here's the thing about the good versions of these agents: before every single tap, they write out a little paragraph of reasoning. "I see the inbox. The compose button is in the bottom right. Tapping it should open a blank email." Then they tap.

0:26Finn: And that narration genuinely helps. It's the same trick that made language models better at math — make them show their work before they answer. The catch is the cost. That paragraph is roughly a hundred words the model has to generate one word at a time, each word a full pass through the network, before it's allowed to do anything at all.

0:47Juniper: Right, and the numbers are kind of brutal when you lay them out. The competing agents in this paper emit around ninety-six to a hundred-and-two tokens per step, and they take something like three to four-and-a-half seconds from the first token to the last — per action. Now string fifteen actions together to finish one task. Every tap comes with its own little essay. The whole thing drags.

1:11Finn: So the paper we're talking about today gets that number down to twenty-one tokens per step, and latency down to about one-point-eight seconds. Same task quality. A fraction of the talking.

1:22Juniper: And before we get into how — the paper went up on arXiv on June third, twenty-twenty-six, and we're recording the very next day, June fourth. Quick note on what you're hearing: this episode is AI-generated. The script was written by Anthropic's Claude Opus 4.8, and the two voices — me, Juniper, and my co-host Finn — are both AI voices from Eleven Labs. The producer isn't affiliated with either company. The paper itself is called "MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models." And that word, mirage, is earning its keep — because the whole move here is making the reasoning vanish from view while it's still happening.

2:04Finn: So that's the puzzle in one line. Keep the benefit of step-by-step reasoning, drop the cost of writing it down. Can the model think hard before it acts — but think silently, inside its own head, instead of typing out every thought?

2:20Juniper: That's exactly the framing. And the honest version of the question has a second half, which is the part I find more interesting. If the reasoning is now invisible — if it never becomes words you can read — how do you make sure it's still actually reasoning? How do you stop it from quietly collapsing into mush?

2:40Finn: Hold that second question, because it's where the cleverest idea in the paper lives. But let's build the picture from the bottom. What does it even mean for a model to "think silently"?

2:52Juniper: Okay. So here's the mental image. When you do arithmetic in your head — say you're splitting a check — you don't narrate "carry the one" out loud. The intermediate steps happen somewhere inside, and only the final number comes out of your mouth. The intermediate work is real; it's just not externalized.

3:11Finn: And a language model is actually already doing something like that. Everything it "knows" at any moment lives in these high-dimensional internal vectors — hidden states. When it writes out a chain of reasoning, it's taking that internal computation and translating it into words. That translation is extra work, and it's lossy.

3:32Juniper: That's the key reframe. Reasoning doesn't have to be words. So MIRAGE does this: it reserves a small handful of continuous hidden vectors — the paper calls them latent slots — as a kind of silent scratchpad. The model thinks in those slots, and crucially, those slots never get turned into vocabulary. They never become text. The action the agent takes gets conditioned on those silent vectors instead of on a written-out rationale.

4:00Finn: There's a direct ancestor here worth naming. A paper called Coconut, from a couple years back, was the one that first taught models to reason in continuous vectors instead of words. MIRAGE is standing on that shoulder. But — and this is where it gets good — Coconut had a real problem.

4:18Juniper: Which was?

4:19Finn: Speed. The "proper" way to fill those latent reasoning slots is serial. You compute slot one. You write it into the sequence. Then you compute slot two — which is now allowed to look back at slot one. Then slot three, which can see one and two. And so on. One full forward pass through the network per slot.

4:38Juniper: So you've moved the reasoning out of words, which saves you the detokenizing cost, but you're still grinding through the slots one at a time. You traded the verbosity tax for a different sequential tax.

4:51Finn: Exactly. And this is the first genuinely clever contribution. The authors call it APLR — Approximate Parallel Latent Refinement. And the insight comes from noticing a structural fact about how these models work. The attention is causal, which just means each position can only look backward — at earlier positions, never later ones. So slot three can depend on slots one and two, but one and two can never depend on three. The dependencies all point one direction.

5:20Juniper: A one-way street.

5:21Finn: A one-way street. And once you see that, you can borrow a trick that's been sitting in numerical methods for over a century — Jacobi iteration. Here's the analogy I'd hold onto. Picture a row of people, each trying to fill in a number, and each person's correct answer depends only on the people to their left.

5:40Juniper: Okay, so the slow way is: leftmost person finishes, then the next person looks left and finishes, then the next —

5:47Finn: One at a time, all the way down the row. That's the serial method. The fast way is: everybody writes down a guess simultaneously. Then everybody revises at the same time, using their neighbors' latest guesses. And you just repeat that a few rounds.

6:02Juniper: And the beautiful part is what happens round by round. The leftmost person depends on nobody, so after round one, they're already correct — guaranteed.

6:11Finn: Right. And after round two, the second person is correct, because the only thing they needed was the first person, who locked in last round. After round three, the third person locks in. So the guarantee is exact: after K rounds, the first K people in the row are provably, exactly right. Only the people further down the line are still carrying some error.

6:34Juniper: So with three parallel passes, the first three thought-slots are mathematically identical to what the slow one-at-a-time method would have given you.

6:43Finn: That's the proposition they prove, and it's unusually clean for a systems paper. Information advances by exactly one slot per round. The cost no longer scales with how many slots you have — it scales with how many refinement rounds you run, which they set to three. Reasoning capacity and compute just got decoupled.

7:03Juniper: But you said it yourself — only the front of the row is exact. There's a tail. The slots further down still carry error after three rounds.

7:12Finn: And that tail error is the whole reason the second big idea exists. Juniper, this is the part you flagged earlier — the worry about the reasoning quietly going lazy.

7:22Juniper: Yeah. So think about what the model is being trained on. The main training signal is: did you pick the right action? Did you tap the right button? That signal is great at correcting any error that would change which button you tap. But it's completely blind to errors that don't change the next action.

7:41Finn: And where do those invisible errors live?

7:44Juniper: In the tail. In exactly those later slots that APLR didn't fully refine. The action loss doesn't supervise them, because they don't move the immediate action token. So you've got these slots that are under-corrected by the parallel shortcut and under-supervised by the training signal. That's the recipe for representational mush.

8:04Finn: So how do they keep them honest?

8:06Juniper: This is the move I think is the heart of the paper. They bolt on a second little head — built from something called a Q-Former, which is just a lightweight attention module that probes a representation with learnable questions. And its job, during training only, is to take the silent thought-slots and predict what the next screen will look like.

8:27Finn: When you say predict the next screen — predict it how? Are they rendering a future screenshot?

8:34Juniper: No, and this is important, because "world model" makes people imagine a system vividly painting future phone screens pixel by pixel. That's not it. They predict the features of the next screen — a compressed description in the model's own internal feature language, not the actual image. The analogy I like: instead of painting a photorealistic picture of the kitchen you're about to walk into, you just jot down "kitchen, fridge on the left, table in the center." Same information you need, none of the rendering cost. No image generator required.

9:10Finn: And it's checked against the real next screen.

9:13Juniper: Against the real next screenshot, run through the model's own frozen vision encoder, used as a fixed target. So the thought-slots are forced to encode not just "what to tap" but "what will the world look like after I tap it." And here's the elegant part — that predict-the-future task naturally lives in the later slots. The exact slots that were under-supervised. So the world-model head delivers dense correction signal precisely where action-imitation was leaving a gap.

9:43Finn: It's a targeted patch. Not "let's supervise everything harder" — it's "let's supervise the specific part the shortcut left soft."

9:52Juniper: And the counterintuitive kicker: the entire world-model head gets thrown away at inference. It exists only during training. The cleanest way I can frame it is a factory quality inspector — someone who stress-tests the parts the fast assembly line didn't fully finish, makes sure they behave, and then is absolutely not shipped inside the product. At deployment, the agent just fills its silent slots and decodes the action. No rationale text, no world model, nothing.

10:22Finn: Training wheels you take off — but specifically training wheels for the wheel that was wobbling.

10:29Juniper: That's it. And to put the whole architecture together, because there are three pieces the listener has to hold at once: there's the vision-language model backbone — the eyes and hands that read the screenshot and the goal. There's the latent slots — the silent scratchpad. And there's the Q-Former — the inspector that shapes the scratchpad during training and then disappears. The whole thing is trained in two stages, and the ordering matters.

10:59Finn: Walk through the two stages, because the ordering is doing real work.

11:04Juniper: Stage one is totally normal. You fine-tune the model to produce visible, written-out reasoning — the old expensive way — but in a structured three-part format. Observation: what's on the screen. Rationale: why this action. And predict: what the screen will look like afterward. That third field is the conceptual bridge — it's teaching the model to think forward, in words, first.

11:29Finn: So stage one teaches the shape of good reasoning, out loud.

11:33Juniper: Right. Then stage two does the surgery. You rip out that written thought block and replace it with the learnable latent slots, and you keep training. You're gradually migrating the reasoning from words into hidden space — instead of asking the model to discover a continuous thought-space completely cold. It already knows what good reasoning looks like; now it learns to do it silently.

11:58Finn: And that ordering turns out to be load-bearing — which brings us to the single best-told story in this paper. The ablation table. Juniper, this is the part where the whole argument either holds together or doesn't, and it's worth going slow.

12:14Juniper: Go for it.

12:15Finn: So they run this on AndroidWorld — and quick grounding, because it's not a household name. AndroidWorld is a live, on-device benchmark. A hundred-and-sixteen real task instances across twenty actual apps, and it measures end-to-end success — did the task actually get done. So these are real success rates, not proxy scores. The base model, untouched, scores forty-two-point-nine. That's your starting line. Now, step one of the ablation: strip out the reasoning entirely. Just train it to map screen straight to action, no thinking at all. What do you think happens?

12:53Juniper: I'd guess it drops a little. You've removed a helpful crutch.

12:57Finn: It collapses to thirty-one. It gets worse than the untouched base model. Stripping out reasoning doesn't just remove a bonus — it actively degrades the model below where it started.

13:09Juniper: That's a striking result, and it tells you something deep. Reasoning isn't a bolt-on accessory. The model's competence is built around it. It's like taking an experienced driver and forbidding them from ever looking ahead or planning — you don't just lose the planning, the retraining to act on pure reflex damages skills they already had.

13:32Finn: Exactly. So that's the floor. Now climb back up. Add explicit, written chain-of-thought — the expensive verbose way — and you jump to fifty-two-point-six. That's the prize. That's what we're trying to match without paying for it. Now move the reasoning into silent slots the slow serial way, Coconut-style: fifty-point-nine. Almost all of it. Silent reasoning nearly matches spoken reasoning.

13:58Juniper: And then APLR — the parallel shortcut.

14:01Finn: APLR alone, no world model, slips to forty-eight-point-two. So the parallel approximation costs you something — that tail error we talked about is real, it shows up as a few points of lost performance. And then you add the world model back in. Full MIRAGE: fifty-two-point-six.

14:19Juniper: It recovers the gap exactly. Right back to the explicit chain-of-thought number.

14:25Finn: To the decimal. Removing reasoning makes it worse than nothing. Explicit reasoning helps. Silent reasoning nearly matches. The parallel shortcut loses a little. And the world model closes the gap precisely. That's the entire thesis of the paper told in five numbers.

14:42Juniper: And the cost side of that trade is the payoff. Matching that fifty-two-point-six, MIRAGE cuts the tokens per task from around a hundred down to under thirty. The bigger eight-billion-parameter model goes even higher — forty-seven-point-six up to fifty-seven-point-eight — on roughly a quarter of the tokens.

15:02Finn: There's one detail in those eight-billion numbers I want to flag, because it preempts the obvious objection. You might think: maybe MIRAGE is just winning by taking shortcuts — fewer steps, less to go wrong. But the eight-billion model actually uses slightly more steps than its baseline. Almost fourteen versus about twelve-and-a-half. It takes a longer path and still wins. So the gain isn't from cutting corners on the task — it's from reasoning better per step.

15:31Juniper: And on the static benchmark, AndroidControl — that's the one with ground-truth action sequences — the low-level action accuracy goes from about seventy-five percent to ninety-one percent, using one-sixth the tokens. That's the cleanest single line in the whole results section.

15:49Finn: One more number that I think is genuinely important, because it tells you the latent space isn't free. They tried shrinking the scratchpad. Drop the four-billion model from nine latent slots down to three, and success craters from fifty-two-point-six to thirty-two-point-eight.

16:07Juniper: So the silent reasoning genuinely needs room. Three slots isn't enough capacity to hold the thought.

16:13Finn: Right — it's not a magic compression where you get the reasoning for free in a tiny space. The thought needs real continuous bandwidth. And similarly, that third refinement pass earns its place: the eight-billion model goes from forty-six-point-six with two passes to fifty-seven-point-eight with three. The "first K slots are exact" guarantee isn't just pretty math — that third round is buying you eleven points of success.

16:41Juniper: Okay. So that's the case for the defense, and it's a strong one. But you've been the skeptic-in-residence all episode, Finn — where does this not hold up as well as the headline suggests?

16:54Finn: A few places, and I want to be specific, because these are real pushbacks, not nitpicks. The first one: that headline claim — "matches explicit chain-of-thought" — rests on a tie at a single number. Full MIRAGE and explicit chain-of-thought both land on exactly fifty-two-point-six on AndroidWorld. A tie is a fair thing to claim. But the serial-latent version was fifty-point-nine and APLR-alone was forty-eight-point-two, sitting just below. So the margin by which "silent matches spoken" actually holds is a few points, and it leans heavily on this one benchmark configuration.

17:33Juniper: And the world model is doing real work to close that specific four-point gap — which is a point in the paper's favor, but it also means the claim is more fragile than "they're identical, done."

17:45Finn: Second pushback, and this is the one I'd press hardest. The story we told — that the latent slots specialize into observation, rationale, and prediction — is suggestive, not demonstrated. They visualize the slots and show that early, middle, and late slots cluster in separate regions, and that action types like tap and swipe and type form distinct groups. It's a nice picture. But the authors are commendably honest that it does not prove the slots carry that three-part structure.

18:17Juniper: Because clustering is correlational. To actually show slot five encodes the prediction, you'd want to reach in, perturb that specific slot, and measure whether the predicted-next-screen part degrades the way your theory says it should. Causal intervention, not just a map of where things land.

18:36Finn: Exactly. And they don't do that intervention. So the interpretive story is appealing and consistent with the evidence, but it's partly inferred. Worth holding loosely.

18:47Juniper: There's a third one I think is the most conceptually interesting, and it's about that word "world model." We were careful to say it predicts features, not pixels. But there's a subtle issue hiding in the choice of target.

19:01Finn: Say more, because I think this is the sharpest critique available.

19:06Juniper: The prediction target is the next screen's features, as seen through the model's own frozen vision encoder. So the thing it's predicting lives on the same manifold as the thing it's taking in — it's the model's own representation language on both ends. Which raises a fair question: is this teaching genuine environment dynamics — how the interface really behaves — or is it just teaching the model to be self-consistent in its own feature space? The bar for calling something a "world model" here is lower than pixel-level prediction or true generative imagination of the future.

19:44Finn: And the "describe the room instead of painting it" framing cuts both ways. Describing the room in your own private vocabulary is genuinely cheaper — but it's also an easier test than actually rendering it. The authors frame it as a deliberate design choice, and it clearly works as a regularizer. But a reviewer is right to ask what's really being learned.

20:08Juniper: To their credit, they're upfront about the boundaries. They list the limitations plainly. It's supervised-only — the agent learns to imitate good trajectories, not to optimize for success through trial and error, so there's no reinforcement learning in the loop. The world modeling is feature-level, not pixel-level. And it only looks one step ahead — next-frame prediction, not a long horizon.

20:34Finn: And the one they flag but explicitly don't tackle: safety. This is an agent that taps and types on your phone autonomously. Before anything like this ships, you need real guardrails — privacy, action confirmation, the ability to stop it doing something irreversible. They name it and move on, which is the honest thing to do, but it's a large door left open.

20:58Juniper: And the tail-error guarantee, as clean as it is, is a local result. The math says which slots are exact under small errors near the training distribution. If the agent hits a screen wildly unlike anything it trained on, the algebra still holds, but whether the learned maps stay well-behaved out there is just untested. It's two benchmarks, at four and eight billion parameters. A research demonstration, not a deployed product.

21:28Finn: Which is the right note to be honest about. The gains are real and the ablation is genuinely beautiful. But "matches chain-of-thought at a fifth of the cost" is a claim about this setting, at this scale, on these apps.

21:43Juniper: So let me try to pull out what I think actually travels beyond mobile phones — because that's where I land on why this matters. Two ideas. The first is the reframe of where reasoning should live. The field's default assumption has been: reasoning equals text you generate. MIRAGE is a clean argument that reasoning can instead be a structured budget of hidden computation — and that you can keep that hidden computation honest by tying it to a concrete predictive task, rather than letting it become an unsupervised black box. The pairing is the thing to remember: silent reasoning, made accountable by forcing it to predict the future.

22:25Finn: And the second idea is APLR on its own terms. Take a problem everyone assumes is inherently sequential — each thought depends on the last — and notice that the one-way dependency structure lets you parallelize it with a provable correctness guarantee. The first K answers lock in, front to back. That's not a fact about phones. That's a fact about any causal, triangular computation. I'd expect to see that trick show up somewhere completely unrelated within a year.

22:56Juniper: For anyone building agents that touch real interfaces — phones, browsers, desktops — the practical message is blunt. The difference between an agent that narrates a paragraph before every tap and one that thinks silently is the difference between something that feels like it's buffering and something that feels responsive. That's not cosmetic. That's whether the thing is usable at all.

23:21Finn: And the deeper bet underneath it — that the future you're trying to predict is the best teacher for the reasoning you can't see — that's an old idea in machine learning wearing a new outfit. It's nice to see it land somewhere this concrete.

23:37Juniper: That's MIRAGE. The reasoning didn't get worse when it went quiet — it just stopped charging you by the word.

23:44Finn: The show notes have a link to the paper and a few related reads if this caught you — the latent-reasoning lineage especially is worth pulling on.

23:53Juniper: And if you want the full transcript with every term defined inline, plus the concept pages that link this episode to the others we've done on reasoning and agents, that all lives on paperdive.ai.

24:06Finn: Thanks for spending it with us.

24:08Juniper: This has been AI Papers: A Deep Dive.

Teaching a Phone Agent to Reason Silently, And Keeping It Honest

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes