Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding

Shrinidhi Kumbhar; Haofu Liao; Srikar Appalaraju; Kunwar Yashraj Singh

arXiv:2603.26211·cs.CV·March 30, 2026

Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding

Shrinidhi Kumbhar, Haofu Liao, Srikar Appalaraju, Kunwar Yashraj Singh

PDF

TL;DR

This paper explores the use of discrete diffusion vision-language models for GUI grounding, showing they outperform linear-masked variants and are competitive with autoregressive models across multiple datasets.

Contribution

It adapts diffusion models for GUI grounding, introduces a hybrid masking schedule, and demonstrates improved accuracy and efficiency over existing autoregressive approaches.

Findings

01

Hybrid masking improves grounding accuracy by up to 6.1 points in SSR.

02

Diffusion models with increased steps and data reduce latency and improve accuracy.

03

Diffusion-based models are a promising alternative for GUI grounding tasks.

Abstract

Autoregressive (AR) vision-language models (VLMs) have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding. We adapt LLaDA-V for single-turn action and bounding-box prediction, framing the task as text generation from multimodal input. To better capture the hierarchical structure of bounding-box geometry, we propose a hybrid masking schedule that combines linear and deterministic masking, improving grounding accuracy by up to 6.1 points in Step Success Rate (SSR) over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.