TL;DR
UI-Zoomer is an adaptive, uncertainty-driven zoom-in framework for GUI grounding that improves localization accuracy without additional training by selectively cropping based on prediction confidence.
Contribution
It introduces a training-free, uncertainty-based adaptive zoom-in method that dynamically adjusts crop sizes for better GUI element localization.
Findings
Achieves up to +13.4% improvement on ScreenSpot-Pro
Demonstrates consistent gains across multiple datasets and models
Operates without additional training, reducing complexity
Abstract
GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
