TL;DR
This paper presents RegionFocus, a visual test-time scaling method that dynamically zooms into relevant webpage regions to improve GUI agent grounding accuracy, achieving state-of-the-art results on multiple benchmarks.
Contribution
It introduces a novel visual test-time scaling approach with an image-as-map mechanism, significantly enhancing GUI agent performance in complex webpage understanding tasks.
Findings
Achieves over 28% performance gain on Screenspot-pro
Improves WebVoyager benchmark accuracy by over 24%
Sets new state-of-the-art of 61.6% grounding performance on ScreenSpot-Pro
Abstract
We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents. Understanding webpages is challenging due to the visual complexity of GUI images and the large number of interface elements, making accurate action selection difficult. Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy. To support this process, we propose an image-as-map mechanism that visualizes key landmarks at each step, providing a transparent action record and enables the agent to effectively choose among action candidates. Even with a simple region selection strategy, we observe significant performance gains of 28+\% on Screenspot-pro and 24+\% on WebVoyager benchmarks on top of two state-of-the-art open vision language model agents, UI-TARS and Qwen2.5-VL, highlighting the effectiveness of visual test-time scaling in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
