TL;DR
GuirlVG introduces a reinforcement learning approach for GUI visual grounding that outperforms traditional supervised fine-tuning methods with significantly fewer training samples by systematically analyzing and stabilizing RFT components.
Contribution
This paper presents GuirlVG, a novel RL-based GUI visual grounding method with a systematic empirical study and a stabilization technique, reducing data requirements and improving performance.
Findings
GuirlVG outperforms SFT trained on over 10M samples using only 5.2K samples.
Achieves 7.7% improvement on ScreenSpot.
Attains 91.9% accuracy on ScreenSpotV2.
Abstract
Graphical user interface visual grounding (GUI-VG), a core capability for GUI agents, has primarily relied on supervised fine-tuning (SFT) of multimodal large language models (MLLMs), which demands extensive data curation and significant training costs. However, as MLLMs continue to advance and even cover GUI domains during pretraining, the necessity of exhaustive SFT post-training becomes increasingly questionable. Meanwhile, recent successes of rule-based reinforcement fine-tuning (RFT) suggest a more efficient alternative. Despite this promise, the optimal manner of applying RFT for GUI-VG remains unexplored. To bridge this gap, we introduce GuirlVG, a reinforcement learning-based GUI-VG method built on a systematic empirical study and a novel stabilization technique. We find that naive application of RFT underperforms the SFT baseline, motivating a deeper exploration. First, we…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper introduces a novel framework that incorporates Rule-based Reinforcement Fine-Tuning (RFT) into GUI visual grounding for the first time. Its results outperform other Supervised Fine-Tuning (SFT)-based methods, providing a valuable indication for future research directions in visual grounding. 2. The results presented in Section 4 are highly impressive: using only 2K or 5.2K training samples, the framework achieves superior performance compared to previous SFT-based methods that rel
1. Although the empirical research approach and writing style are acceptable, the theoretical details of the design and calculation processes need more explicit elaboration. For instance, in Section 3.2, when proposing the “Soft Reward Function,” a specific mathematical formulation would be preferable to purely natural language descriptions. This issue persists in other methodology subsections. Otherwise, this presentation reads more like an application report, which weakens the theoretical no
1. It is a well-motivated study 2. The experimental results are sufficient to convince its effectiveness 3. The methodology is efficient, clear, and easy to follow
1. What about the performances of other steps? Any indications from those? 2. Could be more ablations on hyperparameters of the config.
- Comprehensive empirical study The paper systematically dissects RFT components, including reward design, KL penalty, fine-tuning method, and prompt structure, offering rare empirical clarity in a field often driven by ad hoc innovation. - Novel stabilization mechanism The Adversarial KL Factor dynamically scales the KL penalty, effectively mitigating reward over-optimization—a notable technical contribution to GRPO-style RL for multimodal models. - Strong empirical results GuirlVG achieve
The main reason I gave a score of 4 is that the scale of the empirical experiments are not sufficient to make a strong conclusion: - The experiments are only done on Qwen2.5-VL. While I understand the Qwen-VL series is probably the only modern model architecture choice in the field, more experiments are required to find out if the findings in the paper are universal or model-specific. For example, Finding 5 says "LoRA offers comparable performance to full fine-tuning", but is it the case with Q
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
