UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

Fei Tang; Bofan Chen; Zhengxi Lu; Tongbo Chen; Songqin Nong; Tao Jiang; Wenhao Xu; Weiming Lu; Jun Xiao; Yueting Zhuang; Yongliang Shen

arXiv:2604.14113·cs.CV·April 16, 2026

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Songqin Nong, Tao Jiang, Wenhao Xu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

PDF

1 Repo

TL;DR

UI-Zoomer is an adaptive, uncertainty-driven zoom-in framework for GUI grounding that improves localization accuracy without additional training by selectively cropping based on prediction confidence.

Contribution

It introduces a training-free, uncertainty-based adaptive zoom-in method that dynamically adjusts crop sizes for better GUI element localization.

Findings

01

Achieves up to +13.4% improvement on ScreenSpot-Pro

02

Demonstrates consistent gains across multiple datasets and models

03

Operates without additional training, reducing complexity

Abstract

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zju-real/UI-Zoomer
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.