Visual Test-time Scaling for GUI Agent Grounding

Tiange Luo; Lajanugen Logeswaran; Justin Johnson; Honglak Lee

arXiv:2505.00684·cs.CV·July 15, 2025

Visual Test-time Scaling for GUI Agent Grounding

Tiange Luo, Lajanugen Logeswaran, Justin Johnson, Honglak Lee

PDF

1 Repo

TL;DR

This paper presents RegionFocus, a visual test-time scaling method that dynamically zooms into relevant webpage regions to improve GUI agent grounding accuracy, achieving state-of-the-art results on multiple benchmarks.

Contribution

It introduces a novel visual test-time scaling approach with an image-as-map mechanism, significantly enhancing GUI agent performance in complex webpage understanding tasks.

Findings

01

Achieves over 28% performance gain on Screenspot-pro

02

Improves WebVoyager benchmark accuracy by over 24%

03

Sets new state-of-the-art of 61.6% grounding performance on ScreenSpot-Pro

Abstract

We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents. Understanding webpages is challenging due to the visual complexity of GUI images and the large number of interface elements, making accurate action selection difficult. Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy. To support this process, we propose an image-as-map mechanism that visualizes key landmarks at each step, providing a transparent action record and enables the agent to effectively choose among action candidates. Even with a simple region selection strategy, we observe significant performance gains of 28+\% on Screenspot-pro and 24+\% on WebVoyager benchmarks on top of two state-of-the-art open vision language model agents, UI-TARS and Qwen2.5-VL, highlighting the effectiveness of visual test-time scaling in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tiangeluo/regionfocus
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.