Test-Time Reinforcement Learning for GUI Grounding via Region Consistency

Yong Du; Yuchen Yan; Fei Tang; Zhengxi Lu; Chang Zong; Weiming Lu; Shengpei Jiang; Yongliang Shen

arXiv:2508.05615·cs.CV·November 14, 2025

Test-Time Reinforcement Learning for GUI Grounding via Region Consistency

Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, Yongliang Shen

PDF

Open Access 1 Video

TL;DR

This paper introduces GUI-RC and GUI-RCPO, novel test-time methods that improve GUI grounding accuracy by leveraging spatial consensus among multiple predictions, reducing reliance on labeled data.

Contribution

It presents the first test-time scaling and reinforcement learning techniques for GUI grounding, enhancing accuracy without additional training data.

Findings

01

GUI-RC improves accuracy by 2-3% without training.

02

GUI-RCPO achieves 3-6% accuracy gains using unlabeled data.

03

The methods are effective across various architectures.

Abstract

Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Test-Time Reinforcement Learning for GUI Grounding via Region Consistency· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques