GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning

Weitai Kang; Bin Lei; Gaowen Liu; Caiwen Ding; Yan Yan

arXiv:2508.04389·cs.AI·August 7, 2025

GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning

Weitai Kang, Bin Lei, Gaowen Liu, Caiwen Ding, Yan Yan

PDF

3 Reviews

TL;DR

GuirlVG introduces a reinforcement learning approach for GUI visual grounding that outperforms traditional supervised fine-tuning methods with significantly fewer training samples by systematically analyzing and stabilizing RFT components.

Contribution

This paper presents GuirlVG, a novel RL-based GUI visual grounding method with a systematic empirical study and a stabilization technique, reducing data requirements and improving performance.

Findings

01

GuirlVG outperforms SFT trained on over 10M samples using only 5.2K samples.

02

Achieves 7.7% improvement on ScreenSpot.

03

Attains 91.9% accuracy on ScreenSpotV2.

Abstract

Graphical user interface visual grounding (GUI-VG), a core capability for GUI agents, has primarily relied on supervised fine-tuning (SFT) of multimodal large language models (MLLMs), which demands extensive data curation and significant training costs. However, as MLLMs continue to advance and even cover GUI domains during pretraining, the necessity of exhaustive SFT post-training becomes increasingly questionable. Meanwhile, recent successes of rule-based reinforcement fine-tuning (RFT) suggest a more efficient alternative. Despite this promise, the optimal manner of applying RFT for GUI-VG remains unexplored. To bridge this gap, we introduce GuirlVG, a reinforcement learning-based GUI-VG method built on a systematic empirical study and a novel stabilization technique. We find that naive application of RFT underperforms the SFT baseline, motivating a deeper exploration. First, we…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. This paper introduces a novel framework that incorporates Rule-based Reinforcement Fine-Tuning (RFT) into GUI visual grounding for the first time. Its results outperform other Supervised Fine-Tuning (SFT)-based methods, providing a valuable indication for future research directions in visual grounding. 2. The results presented in Section 4 are highly impressive: using only 2K or 5.2K training samples, the framework achieves superior performance compared to previous SFT-based methods that rel

Weaknesses

1. Although the empirical research approach and writing style are acceptable, the theoretical details of the design and calculation processes need more explicit elaboration. For instance, in Section 3.2, when proposing the “Soft Reward Function,” a specific mathematical formulation would be preferable to purely natural language descriptions. This issue persists in other methodology subsections. Otherwise, this presentation reads more like an application report, which weakens the theoretical no

Reviewer 02Rating 6Confidence 3

Strengths

1. It is a well-motivated study 2. The experimental results are sufficient to convince its effectiveness 3. The methodology is efficient, clear, and easy to follow

Weaknesses

1. What about the performances of other steps? Any indications from those? 2. Could be more ablations on hyperparameters of the config.

Reviewer 03Rating 4Confidence 3

Strengths

- Comprehensive empirical study The paper systematically dissects RFT components, including reward design, KL penalty, fine-tuning method, and prompt structure, offering rare empirical clarity in a field often driven by ad hoc innovation. - Novel stabilization mechanism The Adversarial KL Factor dynamically scales the KL penalty, effectively mitigating reward over-optimization—a notable technical contribution to GRPO-style RL for multimodal models. - Strong empirical results GuirlVG achieve

Weaknesses

The main reason I gave a score of 4 is that the scale of the empirical experiments are not sufficient to make a strong conclusion: - The experiments are only done on Qwen2.5-VL. While I understand the Qwen-VL series is probably the only modern model architecture choice in the field, more experiments are required to find out if the findings in the paper are universal or model-specific. For example, Finding 5 says "LoRA offers comparable performance to full fine-tuning", but is it the case with Q

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.