AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
Ruilin Yao, Shegnwu Xiong, Tianyu Zou, Shili Xiong, Yi Rong

TL;DR
AutoFocus introduces an uncertainty-aware active visual search method for GUI grounding that adaptively refines spatial predictions using token-level perplexity without additional training.
Contribution
It presents a novel, training-free framework that models spatial uncertainty with token perplexity, enabling adaptive zooming and improved GUI grounding performance.
Findings
Consistent performance improvements on ScreenSpot-Pro and ScreenSpot-V2 datasets.
Effective uncertainty modeling enhances localization accuracy.
Outperforms existing zoom-in strategies in GUI grounding tasks.
Abstract
Vision-Language Models (VLMs) have enabled autonomous GUI agents that translate natural language instructions into executable screen coordinates. However, grounding performance degrades in high-resolution interfaces, where dense layouts and small interactive elements expose a resolution gap between modern displays and model input constraints. Existing zoom-in strategies rely on fixed anchors, heuristic grids, or reinforcement learning, lacking a principled mechanism to adaptively determine where refinement is needed and how much spatial uncertainty should be explored. We propose AutoFocus, a training-free, uncertainty-aware active visual search framework for GUI grounding. Our key insight is that token-level perplexity in coordinate generation naturally reflects spatial uncertainty. Rather than committing to a single prediction, AutoFocus samples multiple coordinate hypotheses and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
