GTA1: GUI Test-time Scaling Agent
Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Silvio Savarese, Caiming Xiong, Junnan Li

TL;DR
GTA1 introduces a test-time scaling approach with a judge model for improved decision-making and reinforcement learning for better grounding in GUI agents, achieving state-of-the-art results in task execution and element grounding.
Contribution
The paper presents a novel test-time scaling method with a judge model and reinforcement learning for grounding, advancing GUI agent performance in complex environments.
Findings
Achieves state-of-the-art performance on GUI grounding benchmarks.
Improves task execution accuracy in GUI environments.
Demonstrates effective grounding through reinforcement learning.
Abstract
Graphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, two main challenges arise: i) planning (\ie, the action proposal sequence) under expansive action space, where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, \ie, precisely interacting with visual targets. This paper investigates the aforementioned challenges with our \textbf{G}UI \textbf{T}est-time Scaling \textbf{A}gent, namely GTA1. First, we conduct test-time scaling to select the most appropriate action proposal: at each step, multiple candidate proposals are sampled and evaluated and selected by a judge model. It trades off…
Peer Reviews
Decision·ICLR 2026 Poster
- Strong empirical results across both grounding datasets and interactive environments, reaching SOTA on multiple benchmarks. - The two-stage (planning+grounding) framework makes it straightforward to switch Planner/Judge modules or upgrade the grounding model. - The light-weight data cleaning pipeline can be integrated with existing data pipelines.
- The description of test-time scaling is under-specified. For example, when the judge "picks the best candidate", is the output a score over candidates, or a rewritten action? Some results (e.g., Table 1) could be simplified to improve readability by moving some baselines to the appendix. - Evaluations of test-time scaling (TTS) focus on ablations against the same agent without TTS. At least there should be equal-compute comparisons (e.g., self-consistency / majority voting / alternative sampli
- strong empirical results on relevant benchmarks - simple method - ablations and discussion on the efficacy of the different reward signals (for thinking, for location)
- the dataset curation step should be part of the "algorithm", otherwise the comparison with other methods is not - reduced novelty - no ablation to understand the improvement brought by each of the two components in the agent task execution scenarios
1. The paper tackles two concrete pain points—plan selection and precise grounding, and pairs them with two simple fixes: test-time scaling for planning and a minimal RL recipe for click grounding. The framing is clear and the approach maps tightly to the problems. 2. Test-time scaling actually pays off. Sampling K proposals per step and using a judge consistently boosts success while also cutting wall-clock via concurrent sampling. 3. Directly rewarding clicks inside the target element keeps
1. Test-time scaling samples K proposals each step and uses a judge, which can raise token and inference costs even if concurrent sampling cuts wall-clock time. The paper shows speedups and success gains but does not report detailed cost curves (tokens, latency) across K and horizons. 2. The paper observes little average gain from “thinking” and attributes sample-wise differences to training instability. “Thinking helps only in dynamic UIs” needs stronger controls. The authors can test differe
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Social Robot Interaction and HRI
