Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning
Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, Bo Li

TL;DR
This paper presents a reinforcement learning framework that significantly improves GUI agent grounding accuracy in complex environments using minimal training data, outperforming larger models.
Contribution
Introduces a novel RL-based training method with self-evolutionary finetuning for GUI agents, reducing data needs and boosting performance.
Findings
Achieves 47.3% accuracy on ScreenSpot-Pro with only 3k samples.
Outperforms larger models like UI-TARS-72B by 24.2%.
Demonstrates effectiveness in high-resolution, complex GUI environments.
Abstract
Graphical User Interface (GUI) agents have made substantial strides in understanding and executing user instructions across diverse platforms. Yet, grounding these instructions to precise interface elements remains challenging, especially in complex, high-resolution, professional environments. Traditional supervised finetuning (SFT) methods often require large volumes of diverse data and exhibit weak generalization. To overcome these limitations, we introduce a reinforcement learning (RL) based framework that incorporates three core strategies: (1) seed data curation to ensure high quality training samples, (2) a dense policy gradient that provides continuous feedback based on prediction accuracy, and (3) a self evolutionary reinforcement finetuning mechanism that iteratively refines the model using attention maps. With only 3k training samples, our 7B-parameter model achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics
MethodsSoftmax · Attention Is All You Need
