TL;DR
GoClick is a lightweight, 230M parameter visual grounding model designed for resource-constrained devices, achieving high accuracy and efficiency in GUI element localization for autonomous agents.
Contribution
The paper introduces a novel encoder-decoder architecture and a data refinement pipeline to create an effective small-scale GUI grounding model, outperforming simpler downsized models.
Findings
GoClick matches larger models in grounding accuracy.
The encoder-decoder architecture outperforms decoder-only models at small scales.
Data refinement improves training quality and model performance.
Abstract
Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained devices like mobile phones is increasingly critical for GUI agents requiring low latency. However, this goal faces a significant challenge, as current visual grounding methods typically employ large vision-language model (VLM) (more than 2.5B parameters), making them impractical for on-device execution due to memory and computational constraints. To address this, this paper introduces GoClick, a lightweight GUI element grounding VLM with only 230M parameters that achieves excellent visual grounding accuracy, even on par with significantly larger models. Simply downsizing existing decoder-only VLMs is a straightforward way to design a lightweight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
