SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing, Zhang, Zhiyong Wu

TL;DR
SeeClick introduces a visual GUI agent that relies solely on screenshots for task automation, addressing limitations of structured data dependence and advancing GUI grounding techniques with a new benchmark and pre-training methods.
Contribution
The paper presents SeeClick, a novel visual GUI agent that uses screenshots, along with GUI grounding pre-training and a new benchmark, ScreenSpot, to improve task automation across devices.
Findings
SeeClick outperforms baselines on ScreenSpot.
Advancements in GUI grounding improve downstream task performance.
Pre-training enhances visual GUI agent capabilities.
Abstract
Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be notably lengthy (e.g., HTML) and occasionally inaccessible (e.g., on desktops). To alleviate this issue, we propose a novel visual GUI agent -- SeeClick, which only relies on screenshots for task automation. In our preliminary study, we have discovered a key challenge in developing visual GUI agents: GUI grounding -- the capacity to accurately locate screen elements based on instructions. To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data. Along with the efforts above, we have also created ScreenSpot, the first realistic GUI grounding benchmark that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsInteractive and Immersive Displays · Virtual Reality Applications and Impacts
