Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
Shrinidhi Kumbhar, Haofu Liao, Srikar Appalaraju, Kunwar Yashraj Singh

TL;DR
This paper explores the use of discrete diffusion vision-language models for GUI grounding, showing they outperform linear-masked variants and are competitive with autoregressive models across multiple datasets.
Contribution
It adapts diffusion models for GUI grounding, introduces a hybrid masking schedule, and demonstrates improved accuracy and efficiency over existing autoregressive approaches.
Findings
Hybrid masking improves grounding accuracy by up to 6.1 points in SSR.
Diffusion models with increased steps and data reduce latency and improve accuracy.
Diffusion-based models are a promising alternative for GUI grounding tasks.
Abstract
Autoregressive (AR) vision-language models (VLMs) have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding. We adapt LLaDA-V for single-turn action and bounding-box prediction, framing the task as text generation from multimodal input. To better capture the hierarchical structure of bounding-box geometry, we propose a hybrid masking schedule that combines linear and deterministic masking, improving grounding accuracy by up to 6.1 points in Step Success Rate (SSR) over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
