SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng; Qiushi Sun; Yougang Chu; Fangzhi Xu; Yantao Li; Jianbing; Zhang; Zhiyong Wu

arXiv:2401.10935·cs.HC·February 26, 2024·1 cites

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing, Zhang, Zhiyong Wu

PDF

Open Access 1 Repo 1 Models 5 Datasets 1 Video

TL;DR

SeeClick introduces a visual GUI agent that relies solely on screenshots for task automation, addressing limitations of structured data dependence and advancing GUI grounding techniques with a new benchmark and pre-training methods.

Contribution

The paper presents SeeClick, a novel visual GUI agent that uses screenshots, along with GUI grounding pre-training and a new benchmark, ScreenSpot, to improve task automation across devices.

Findings

01

SeeClick outperforms baselines on ScreenSpot.

02

Advancements in GUI grounding improve downstream task performance.

03

Pre-training enhances visual GUI agent capabilities.

Abstract

Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be notably lengthy (e.g., HTML) and occasionally inaccessible (e.g., on desktops). To alleviate this issue, we propose a novel visual GUI agent -- SeeClick, which only relies on screenshots for task automation. In our preliminary study, we have discovered a key challenge in developing visual GUI agents: GUI grounding -- the capacity to accurately locate screen elements based on instructions. To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data. Along with the efforts above, we have also created ScreenSpot, the first realistic GUI grounding benchmark that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

njucckevin/seeclick
pytorchOfficial

Models

🤗
vocaela/Vocaela-500M
model· 40 dl· ♡ 3
40 dl♡ 3

Datasets

Videos

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents· underline

Taxonomy

TopicsInteractive and Immersive Displays · Virtual Reality Applications and Impacts