Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, Yu Su

TL;DR
This paper introduces UGround, a universal visual grounding model trained on a large GUI dataset, enabling GUI agents to perceive and interact with interfaces visually, outperforming existing models and agents.
Contribution
The paper presents a novel visual grounding approach for GUI agents, including a large dataset, a simple training recipe, and a model that surpasses existing methods in multiple benchmarks.
Findings
UGround outperforms existing visual grounding models by up to 20%.
Agents with UGround outperform state-of-the-art agents using only visual input.
The approach demonstrates the feasibility of human-like visual navigation in GUI agents.
Abstract
Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly perform pixel-level operations on the GUI. The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms. We show that a…
Peer Reviews
Decision·ICLR 2025 Oral
- This paper contributes to reframes GUI interaction as a pure visual grounding problem, challenging the conventional wisdom that additional textual representations are necessary. - The authors develop a novel way to generate diverse referring expressions (REs) by categorizing them into visual, positional, and functional types. And they introduce an innovative hybrid data synthesis pipeline that combines rule-based and LLM-based approaches - This paper includes a comprehensive agent evaluation c
- UGround relies on an external LLM planner and cannot operate independently as a GUI agent without training on downstream tasks. When combined with the Scaling Curve in Figure 5 on Web-Hybrid, it becomes challenging to enhance agent performance by merely increasing grounding data. Instead, improvements depend on the external LLM planner, which may limit the potential of the SeeAct-V framework. - In the current model architecture, the authors have increased the input image size to 36 grids of C
1. Technical Innovation: - Novel hybrid synthesis pipeline combining rule-based and LLM-based approaches - Successful demonstration of cross-platform generalization without platform-specific training - Effective vision-only framework that eliminates dependency on HTML/accessibility trees 2. Experimental Rigor: - Comprehensive evaluation across multiple platforms and settings - Strong performance improvements (up to 20% absolute improvement in standard setting) - Thorough ablation studies on tra
1. Data Efficiency: - Heavy reliance on large-scale synthetic data - Potential redundancy in web-based training data - Room for improvement in data deduplication and grouping 2. Limited Coverage: - Lack of desktop UI data in training - Incomplete handling of long-tail elements - Platform-specific icons and elements not fully addressed 3. Dependencies: - Reliance on external planner - No end-to-end training with downstream tasks - Limited standalone capability as a GUI agent
This paper propose a visual-based GUI Agent to avoid the limitations of language-based approaches. A large-scale dataset is collected through a carefully designed data collection method and used to train the visual grounding model. The paper is well-structured, clearly articulated, and demonstrates solid effectiveness in the proposed method.
Limitations in Completeness 1. This paper compares *UGround* with other models in Table 2 and shows UGround’s universal grounding capability. However, these methods differ in both their model settings and the training data used. An ablation study is missing to clarify the contributions of the model design (specifically the image resolution setting) and the training data to UGround's performance. 2. Same issue as in 1. This paper proposes 3 types of REs for GUI elements. An ablation study is mi
Code & Models
- 🤗osunlp/UGroundmodel· 23 dl· ♡ 2423 dl♡ 24
- 🤗osunlp/UGround-V1-7Bmodel· 393 dl· ♡ 20393 dl♡ 20
- 🤗osunlp/UGround-V1-2Bmodel· 1.4k dl· ♡ 101.4k dl♡ 10
- 🤗osunlp/UGround-V1-72Bmodel· 21 dl· ♡ 421 dl♡ 4
- 🤗1ForrestW1/pground-endpoint1model
- 🤗vocaela/Vocaela-500Mmodel· 40 dl· ♡ 340 dl♡ 3
- 🤗mlx-community/UGround-V1-2B-bf16model· 20 dl20 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Social Robot Interaction and HRI
