ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search
Hyunseok Lee, Jeonghoon Kim, Beomjun Kim, Jihoon Tack, Chansong Jo, Jaehong Lee, Cheonbok Park, Sookyo In, Jinwoo Shin, Kang Min Yoo

TL;DR
ReGUIDE introduces a data-efficient framework for GUI element localization in multimodal models, leveraging reasoning, spatial priors, and test-time strategies to outperform existing methods with minimal training data.
Contribution
ReGUIDE presents a novel approach combining reasoning, spatial criticism, and test-time scaling for efficient GUI grounding in multimodal models.
Findings
Significantly outperforms baselines with only 0.2% of training data.
Uses self-generated reasoning and spatial priors for improved localization.
Achieves state-of-the-art results across multiple benchmarks.
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have enabled autonomous agents to interact with computers via Graphical User Interfaces (GUIs), where accurately localizing the coordinates of interface elements (e.g., buttons) is often required for fine-grained actions. However, this remains significantly challenging, leading prior works to rely on large-scale web datasets to improve the grounding accuracy. In this work, we propose Reasoning Graphical User Interface Grounding for Data Efficiency (ReGUIDE), a novel and effective framework for web grounding that enables MLLMs to learn data efficiently through self-generated reasoning and spatial-aware criticism. More specifically, ReGUIDE learns to (i) self-generate a language reasoning process for the localization via online reinforcement learning, and (ii) criticize the prediction using spatial priors that enforce…
Peer Reviews
Decision·Submitted to ICLR 2026
- Data Efficiency. This is the paper's primary contribution and it is highly significant. Achieving nice performance while training on only 20k samples versus 10M is a major step forward for the field, making high-performance grounding more accessible. - Effective Inference-Time Search. The test-time scaling strategy, which uses KDE for an initial vote, crop, and then vote again, is well-motivated and empirically powerful. The paper shows this is particularly effective for high-resolution images
- Two very relevant work, Aria-UI (for GUI grounding SFT) and GTA-1 (for GUI RL training with GRPO) is not discussed nor compared in the paper. - From Tab.5, the proposed two test-time scaling strategies play important role in model's performance with SS and SS-pro. The questions here would be: 1) since we may easily move the two strategies to existing models like UGround, will they benefit from it? 2) since without the two strategies, the proposed model generally performs on par with the baseli
**1. Novel Integrated Data-Efficient Framework:** The paper proposes a highly novel framework, *ReGUIDE*, that tackles the GUI grounding problem with exceptional data efficiency. Its originality lies in the synergistic integration of components across both training and inference: reinforcement learning for self-generated language reasoning, a subsequent training stage enforcing spatial consistency under transformations, and a final test-time spatial search with KDE-based aggregation. This compr
**1. Lack of comparison with other RL-based methods:** For instance, *UI-AGILE-7B*, which also employs reinforcement learning with only 9K examples, achieves 48.7% accuracy on ScreenSpot-Pro when initialized from Qwen2.5-VL, which is comparable to or even surpasses ReGUIDE’s results under a smaller data regime. It would be helpful to discuss this comparison to clearly highlight the differences between the two approaches. **2. Inference latency:** ReGUIDE’s performance gain largely depends o
1. The paper provides an extensive and comprehensive evaluation of ReGUIDE across multiple benchmarks. 2. The writing and organization are clear, coherent, and well-structured. 3. The proposed training and testing time scaling strategies are effective, and the combination of global–local search with voting demonstrates strong performance.
1. The main concern with this work lies in whether the proposed model truly achieves state-of-the-art grounding performance. In fact, several prior studies have already reported superior results. However, the authors did not include these stronger baselines (e.g., [1–3]) in their benchmark comparisons. 2. The ensuing concern is that the related work does not encompass the latest developments. 3. The results presented in Table 7 seem somewhat counterintuitive. First, it is unclear why five data
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Augmented Reality Applications · Robotics and Automated Systems
