DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning

Hang Wu; Hongkai Chen; Yujun Cai; Chang Liu; Qingwen Ye; Ming-Hsuan Yang; Yiwei Wang

arXiv:2507.00008·cs.AI·September 8, 2025

DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning

Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, Yiwei Wang

PDF

Open Access

TL;DR

DiMo-GUI introduces a training-free, modality-aware visual reasoning framework that improves GUI grounding by dynamically refining focus on ambiguous regions, enhancing accuracy without extra training.

Contribution

It presents a novel, training-free approach that separates visual modalities and employs hierarchical region refinement for improved GUI grounding.

Findings

01

Consistent performance improvements on GUI grounding benchmarks.

02

Effective disambiguation in visually cluttered GUI layouts.

03

No additional training or annotations required.

Abstract

Grounding natural language queries in graphical user interfaces (GUIs) poses unique challenges due to the diversity of visual elements, spatial clutter, and the ambiguity of language. In this paper, we introduce DiMo-GUI, a training-free framework for GUI grounding that leverages two core strategies: dynamic visual grounding and modality-aware optimization. Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements, allowing the model to reason over each modality independently using general-purpose vision-language models. When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions centered on the model's initial predictions and incrementally zooms into subregions to refine the grounding result. This hierarchical refinement process helps disambiguate visually…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Data Visualization and Analytics