GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Qianhui Wu; Kanzhi Cheng; Rui Yang; Chaoyun Zhang; Jianwei Yang; Huiqiang Jiang; Jian Mu; Baolin Peng; Bo Qiao; Reuben Tan; Si Qin; Lars Liden; Qingwei Lin; Huan Zhang; Tong Zhang; Jianbing Zhang; Dongmei Zhang; Jianfeng Gao

arXiv:2506.03143·cs.CL·June 4, 2025

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, Si Qin, Lars Liden, Qingwei Lin, Huan Zhang, Tong Zhang, Jianbing Zhang, Dongmei Zhang, Jianfeng Gao

PDF

Open Access 9 Models 1 Datasets

TL;DR

GUI-Actor introduces a coordinate-free, attention-based visual grounding method for GUI agents that improves accuracy, generalization, and efficiency by learning to align visual patches with action tokens and using a verifier for candidate selection.

Contribution

The paper presents GUI-Actor, a novel coordinate-free visual grounding approach that leverages attention mechanisms and a grounding verifier, outperforming prior methods and enabling efficient fine-tuning.

Findings

01

Outperforms state-of-the-art on multiple benchmarks.

02

Generalizes well to unseen screen resolutions and layouts.

03

Achieves high performance with minimal fine-tuning of the action head.

Abstract

One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated <ACTOR> token with all relevant visual patch tokens, enabling the model to propose one or more action regions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

cckevinn/GUI-Actor-Data
dataset· 756 dl
756 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotic Path Planning Algorithms · Gaze Tracking and Assistive Technology

MethodsALIGN