Code2World: A GUI World Model via Renderable Code Generation

Yuhao Zheng; Li'an Zhong; Yi Wang; Rui Dai; Kaikui Liu; Xiangxiang Chu; Linyuan Lv; Philip Torr; Kevin Qinghong Lin

arXiv:2602.09856·cs.CV·February 11, 2026

Code2World: A GUI World Model via Renderable Code Generation

Yuhao Zheng, Li'an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, Kevin Qinghong Lin

PDF

Open Access

TL;DR

Code2World introduces a novel approach for GUI environment modeling by generating renderable code, improving visual fidelity and structural control for autonomous agents, and achieving state-of-the-art UI prediction and navigation success rates.

Contribution

It presents Code2World, a vision-language model that generates renderable code for GUI prediction, addressing data scarcity with a new AndroidCode dataset and employing reinforcement learning for visual-semantic fidelity.

Findings

01

Achieves top UI prediction performance rivaling GPT-5 and Gemini-3-Pro-Image.

02

Significantly improves navigation success rates, boosting Gemini-2.5-Flash by +9.5%.

03

Demonstrates the effectiveness of code-based GUI modeling for autonomous agents.

Abstract

Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Artificial Intelligence in Games