Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Boyu Gou; Ruohan Wang; Boyuan Zheng; Yanan Xie; Cheng Chang; Yiheng Shu; Huan Sun; Yu Su

arXiv:2410.05243·cs.AI·June 18, 2025·2 cites

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, Yu Su

PDF

Open Access 1 Repo 7 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces UGround, a universal visual grounding model trained on a large GUI dataset, enabling GUI agents to perceive and interact with interfaces visually, outperforming existing models and agents.

Contribution

The paper presents a novel visual grounding approach for GUI agents, including a large dataset, a simple training recipe, and a model that surpasses existing methods in multiple benchmarks.

Findings

01

UGround outperforms existing visual grounding models by up to 20%.

02

Agents with UGround outperform state-of-the-art agents using only visual input.

03

The approach demonstrates the feasibility of human-like visual navigation in GUI agents.

Abstract

Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly perform pixel-level operations on the GUI. The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms. We show that a…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 4

Strengths

- This paper contributes to reframes GUI interaction as a pure visual grounding problem, challenging the conventional wisdom that additional textual representations are necessary. - The authors develop a novel way to generate diverse referring expressions (REs) by categorizing them into visual, positional, and functional types. And they introduce an innovative hybrid data synthesis pipeline that combines rule-based and LLM-based approaches - This paper includes a comprehensive agent evaluation c

Weaknesses

- UGround relies on an external LLM planner and cannot operate independently as a GUI agent without training on downstream tasks. When combined with the Scaling Curve in Figure 5 on Web-Hybrid, it becomes challenging to enhance agent performance by merely increasing grounding data. Instead, improvements depend on the external LLM planner, which may limit the potential of the SeeAct-V framework. - In the current model architecture, the authors have increased the input image size to 36 grids of C

Reviewer 02Rating 8Confidence 5

Strengths

1. Technical Innovation: - Novel hybrid synthesis pipeline combining rule-based and LLM-based approaches - Successful demonstration of cross-platform generalization without platform-specific training - Effective vision-only framework that eliminates dependency on HTML/accessibility trees 2. Experimental Rigor: - Comprehensive evaluation across multiple platforms and settings - Strong performance improvements (up to 20% absolute improvement in standard setting) - Thorough ablation studies on tra

Weaknesses

1. Data Efficiency: - Heavy reliance on large-scale synthetic data - Potential redundancy in web-based training data - Room for improvement in data deduplication and grouping 2. Limited Coverage: - Lack of desktop UI data in training - Incomplete handling of long-tail elements - Platform-specific icons and elements not fully addressed 3. Dependencies: - Reliance on external planner - No end-to-end training with downstream tasks - Limited standalone capability as a GUI agent

Reviewer 03Rating 5Confidence 4

Strengths

This paper propose a visual-based GUI Agent to avoid the limitations of language-based approaches. A large-scale dataset is collected through a carefully designed data collection method and used to train the visual grounding model. The paper is well-structured, clearly articulated, and demonstrates solid effectiveness in the proposed method.

Weaknesses

Limitations in Completeness 1. This paper compares *UGround* with other models in Table 2 and shows UGround’s universal grounding capability. However, these methods differ in both their model settings and the training data used. An ablation study is missing to clarify the contributions of the model design (specifically the image resolution setting) and the training data to UGround's performance. 2. Same issue as in 1. This paper proposes 3 types of REs for GUI elements. An ablation study is mi

Code & Models

Repositories

OSU-NLP-Group/UGround
noneOfficial

Models

Datasets

mlfoundations/Click-100k
dataset· 1.2k dl
1.2k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Social Robot Interaction and HRI