Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

Zhiyuan Jiang; Shenghao Xie; Wenyi Li; Wenqiang Zu; Peihang Li; Jiahao Qiu; Siqi Pei; Lei Ma; Tiejun Huang; Mengdi Wang; Shilong Liu

arXiv:2512.05941·cs.CV·December 8, 2025

Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

Zhiyuan Jiang, Shenghao Xie, Wenyi Li, Wenqiang Zu, Peihang Li, Jiahao Qiu, Siqi Pei, Lei Ma, Tiejun Huang, Mengdi Wang, Shilong Liu

PDF

Open Access

TL;DR

This paper introduces ZoomClick, a training-free method leveraging zoom properties to enhance GUI grounding, achieving state-of-the-art results and proposing a new benchmark for zoom adaptability in GUI models.

Contribution

It explores zoom as a prior for GUI grounding, proposes a novel zoom-based method, and introduces a benchmark to evaluate zoom adaptability in GUI grounding models.

Findings

01

ZoomClick significantly improves GUI grounding performance.

02

Achieved state-of-the-art results on multiple benchmarks.

03

Introduced GUIZoom-Bench for zoom adaptability evaluation.

Abstract

Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Interactive and Immersive Displays