GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

Hongxin Li; Yuntao Chen; Zhaoxiang Zhang

arXiv:2604.23941·cs.CV·April 28, 2026

GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

Hongxin Li, Yuntao Chen, Zhaoxiang Zhang

PDF

1 Repo 2 Models

TL;DR

GoClick is a lightweight, 230M parameter visual grounding model designed for resource-constrained devices, achieving high accuracy and efficiency in GUI element localization for autonomous agents.

Contribution

The paper introduces a novel encoder-decoder architecture and a data refinement pipeline to create an effective small-scale GUI grounding model, outperforming simpler downsized models.

Findings

01

GoClick matches larger models in grounding accuracy.

02

The encoder-decoder architecture outperforms decoder-only models at small scales.

03

Data refinement improves training quality and model performance.

Abstract

Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained devices like mobile phones is increasingly critical for GUI agents requiring low latency. However, this goal faces a significant challenge, as current visual grounding methods typically employ large vision-language model (VLM) (more than 2.5B parameters), making them impractical for on-device execution due to memory and computational constraints. To address this, this paper introduces GoClick, a lightweight GUI element grounding VLM with only 230M parameters that achieves excellent visual grounding accuracy, even on par with significantly larger models. Simply downsizing existing decoder-only VLMs is a straightforward way to design a lightweight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zjulihongxin/GoClick
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.