GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, Si Qin, Lars Liden, Qingwei Lin, Huan Zhang, Tong Zhang, Jianbing Zhang, Dongmei Zhang, Jianfeng Gao

TL;DR
GUI-Actor introduces a coordinate-free, attention-based visual grounding method for GUI agents that improves accuracy, generalization, and efficiency by learning to align visual patches with action tokens and using a verifier for candidate selection.
Contribution
The paper presents GUI-Actor, a novel coordinate-free visual grounding approach that leverages attention mechanisms and a grounding verifier, outperforming prior methods and enabling efficient fine-tuning.
Findings
Outperforms state-of-the-art on multiple benchmarks.
Generalizes well to unseen screen resolutions and layouts.
Achieves high performance with minimal fine-tuning of the action head.
Abstract
One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated <ACTOR> token with all relevant visual patch tokens, enabling the model to propose one or more action regions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/GUI-Actor-7B-Qwen2.5-VLmodel· 313 dl· ♡ 24313 dl♡ 24
- 🤗microsoft/GUI-Actor-7B-Qwen2-VLmodel· 88 dl· ♡ 3988 dl♡ 39
- 🤗microsoft/GUI-Actor-2B-Qwen2-VLmodel· 428 dl· ♡ 19428 dl♡ 19
- 🤗microsoft/GUI-Actor-3B-Qwen2.5-VLmodel· 342 dl· ♡ 10342 dl♡ 10
- 🤗microsoft/GUI-Actor-Verifier-2Bmodel· 345 dl· ♡ 13345 dl♡ 13
- 🤗qianhuiwu/GUI-Actor-7B-Qwen2.5-VL-LiteTrainmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗qianhuiwu/GUI-Actor-3B-Qwen2.5-VL-LiteTrainmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗vocaela/Vocaela-500Mmodel· 40 dl· ♡ 340 dl♡ 3
- 🤗lukas-agentix/GUI-Actor-7B-Qwen2.5-VLmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotic Path Planning Algorithms · Gaze Tracking and Assistive Technology
MethodsALIGN
