Test-Time Reinforcement Learning for GUI Grounding via Region Consistency
Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, Yongliang Shen

TL;DR
This paper introduces GUI-RC and GUI-RCPO, novel test-time methods that improve GUI grounding accuracy by leveraging spatial consensus among multiple predictions, reducing reliance on labeled data.
Contribution
It presents the first test-time scaling and reinforcement learning techniques for GUI grounding, enhancing accuracy without additional training data.
Findings
GUI-RC improves accuracy by 2-3% without training.
GUI-RCPO achieves 3-6% accuracy gains using unlabeled data.
The methods are effective across various architectures.
Abstract
Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques
