Code2World: A GUI World Model via Renderable Code Generation
Yuhao Zheng, Li'an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, Kevin Qinghong Lin

TL;DR
Code2World introduces a novel approach for GUI environment modeling by generating renderable code, improving visual fidelity and structural control for autonomous agents, and achieving state-of-the-art UI prediction and navigation success rates.
Contribution
It presents Code2World, a vision-language model that generates renderable code for GUI prediction, addressing data scarcity with a new AndroidCode dataset and employing reinforcement learning for visual-semantic fidelity.
Findings
Achieves top UI prediction performance rivaling GPT-5 and Gemini-3-Pro-Image.
Significantly improves navigation success rates, boosting Gemini-2.5-Flash by +9.5%.
Demonstrates the effectiveness of code-based GUI modeling for autonomous agents.
Abstract
Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Artificial Intelligence in Games
