AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding

Ruilin Yao; Shegnwu Xiong; Tianyu Zou; Shili Xiong; Yi Rong

arXiv:2605.02630·cs.CV·May 5, 2026

AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding

Ruilin Yao, Shegnwu Xiong, Tianyu Zou, Shili Xiong, Yi Rong

PDF

TL;DR

AutoFocus introduces an uncertainty-aware active visual search method for GUI grounding that adaptively refines spatial predictions using token-level perplexity without additional training.

Contribution

It presents a novel, training-free framework that models spatial uncertainty with token perplexity, enabling adaptive zooming and improved GUI grounding performance.

Findings

01

Consistent performance improvements on ScreenSpot-Pro and ScreenSpot-V2 datasets.

02

Effective uncertainty modeling enhances localization accuracy.

03

Outperforms existing zoom-in strategies in GUI grounding tasks.

Abstract

Vision-Language Models (VLMs) have enabled autonomous GUI agents that translate natural language instructions into executable screen coordinates. However, grounding performance degrades in high-resolution interfaces, where dense layouts and small interactive elements expose a resolution gap between modern displays and model input constraints. Existing zoom-in strategies rely on fixed anchors, heuristic grids, or reinforcement learning, lacking a principled mechanism to adaptively determine where refinement is needed and how much spatial uncertainty should be explored. We propose AutoFocus, a training-free, uncertainty-aware active visual search framework for GUI grounding. Our key insight is that token-level perplexity in coordinate generation naturally reflects spatial uncertainty. Rather than committing to a single prediction, AutoFocus samples multiple coordinate hypotheses and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.