GTA1: GUI Test-time Scaling Agent

Yan Yang; Dongxu Li; Yutong Dai; Yuhao Yang; Ziyang Luo; Zirui Zhao; Zhiyuan Hu; Junzhe Huang; Amrita Saha; Zeyuan Chen; Ran Xu; Liyuan Pan; Silvio Savarese; Caiming Xiong; Junnan Li

arXiv:2507.05791·cs.AI·October 7, 2025

GTA1: GUI Test-time Scaling Agent

Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Silvio Savarese, Caiming Xiong, Junnan Li

PDF

Open Access 1 Repo 6 Models 1 Datasets 3 Reviews

TL;DR

GTA1 introduces a test-time scaling approach with a judge model for improved decision-making and reinforcement learning for better grounding in GUI agents, achieving state-of-the-art results in task execution and element grounding.

Contribution

The paper presents a novel test-time scaling method with a judge model and reinforcement learning for grounding, advancing GUI agent performance in complex environments.

Findings

01

Achieves state-of-the-art performance on GUI grounding benchmarks.

02

Improves task execution accuracy in GUI environments.

03

Demonstrates effective grounding through reinforcement learning.

Abstract

Graphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, two main challenges arise: i) planning (\ie, the action proposal sequence) under expansive action space, where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, \ie, precisely interacting with visual targets. This paper investigates the aforementioned challenges with our \textbf{G}UI \textbf{T}est-time Scaling \textbf{A}gent, namely GTA1. First, we conduct test-time scaling to select the most appropriate action proposal: at each step, multiple candidate proposals are sampled and evaluated and selected by a judge model. It trades off…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- Strong empirical results across both grounding datasets and interactive environments, reaching SOTA on multiple benchmarks. - The two-stage (planning+grounding) framework makes it straightforward to switch Planner/Judge modules or upgrade the grounding model. - The light-weight data cleaning pipeline can be integrated with existing data pipelines.

Weaknesses

- The description of test-time scaling is under-specified. For example, when the judge "picks the best candidate", is the output a score over candidates, or a rewritten action? Some results (e.g., Table 1) could be simplified to improve readability by moving some baselines to the appendix. - Evaluations of test-time scaling (TTS) focus on ablations against the same agent without TTS. At least there should be equal-compute comparisons (e.g., self-consistency / majority voting / alternative sampli

Reviewer 02Rating 4Confidence 4

Strengths

- strong empirical results on relevant benchmarks - simple method - ablations and discussion on the efficacy of the different reward signals (for thinking, for location)

Weaknesses

- the dataset curation step should be part of the "algorithm", otherwise the comparison with other methods is not - reduced novelty - no ablation to understand the improvement brought by each of the two components in the agent task execution scenarios

Reviewer 03Rating 8Confidence 3

Strengths

1. The paper tackles two concrete pain points—plan selection and precise grounding, and pairs them with two simple fixes: test-time scaling for planning and a minimal RL recipe for click grounding. The framing is clear and the approach maps tightly to the problems. 2. Test-time scaling actually pays off. Sampling K proposals per step and using a judge consistently boosts success while also cutting wall-clock via concurrent sampling. 3. Directly rewarding clicks inside the target element keeps

Weaknesses

1. Test-time scaling samples K proposals each step and uses a judge, which can raise token and inference costs even if concurrent sampling cuts wall-clock time. The paper shows speedups and success gains but does not report detailed cost curves (tokens, latency) across K and horizons. 2. The paper observes little average gain from “thinking” and attributes sample-wise differences to training instability. “Thinking helps only in dynamic UIs” needs stronger controls. The authors can test differe

Code & Models

Repositories

yan98/gta1
pytorchOfficial

Models

Datasets

Salesforce/grounding_dataset
dataset· 481 dl
481 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Social Robot Interaction and HRI