Generative Visual Code Mobile World Models
Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, Jamin Shin

TL;DR
This paper introduces gWorld, a novel visual world modeling approach for mobile GUIs that uses code generation to achieve high-fidelity visual rendering and precise text output, outperforming larger models.
Contribution
The paper presents gWorld, the first open-weight visual GUI world model based on renderable code generation, along with a data synthesis framework and extensive evaluation results.
Findings
gWorld outperforms larger models in accuracy across benchmarks.
Scaling training data improves model performance.
Component-wise pipeline improvements enhance data quality.
Abstract
Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Generative Adversarial Networks and Image Synthesis
