How Mobile World Model Guides GUI Agents?

Weikai Xu; Kun Huang; Yunren Feng; Jiaxing Li; Yuhan Chen; Yuxuan Liu; Zhizheng Jiang; Heng Qu; Pengzhi Gao; Wei Liu; Jian Luan; Xiaolin Hu; Bo An

arXiv:2605.10347·cs.AI·May 12, 2026

How Mobile World Model Guides GUI Agents?

Weikai Xu, Kun Huang, Yunren Feng, Jiaxing Li, Yuhan Chen, Yuxuan Liu, Zhizheng Jiang, Heng Qu, Pengzhi Gao, Wei Liu, Jian Luan, Xiaolin Hu, Bo An

PDF

TL;DR

This paper investigates how different representations in mobile world models affect GUI agent performance, emphasizing multimodal data and the utility of generated trajectories for training and guidance.

Contribution

It introduces and evaluates four modalities of world models, demonstrating their impact on performance and providing insights into their use for training and online guidance.

Findings

01

Renderable code reconstruction offers high fidelity and effective supervision.

02

Text-based feedback is more robust for out-of-distribution execution.

03

Generated trajectories improve task performance but do not match original data distribution.

Abstract

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.