Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning

Xinbin Yuan; Jian Zhang; Kaixin Li; Zhuoxuan Cai; Lujian Yao; Jie Chen; Enguang Wang; Qibin Hou; Jinwei Chen; Peng-Tao Jiang; Bo Li

arXiv:2505.12370·cs.AI·May 27, 2025

Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning

Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, Bo Li

PDF

Open Access 2 Repos

TL;DR

This paper presents a reinforcement learning framework that significantly improves GUI agent grounding accuracy in complex environments using minimal training data, outperforming larger models.

Contribution

Introduces a novel RL-based training method with self-evolutionary finetuning for GUI agents, reducing data needs and boosting performance.

Findings

01

Achieves 47.3% accuracy on ScreenSpot-Pro with only 3k samples.

02

Outperforms larger models like UI-TARS-72B by 24.2%.

03

Demonstrates effectiveness in high-resolution, complex GUI environments.

Abstract

Graphical User Interface (GUI) agents have made substantial strides in understanding and executing user instructions across diverse platforms. Yet, grounding these instructions to precise interface elements remains challenging, especially in complex, high-resolution, professional environments. Traditional supervised finetuning (SFT) methods often require large volumes of diverse data and exhibit weak generalization. To overcome these limitations, we introduce a reinforcement learning (RL) based framework that incorporates three core strategies: (1) seed data curation to ensure high quality training samples, (2) a dense policy gradient that provides continuous feedback based on prediction accuracy, and (3) a self evolutionary reinforcement finetuning mechanism that iteratively refines the model using attention maps. With only 3k training samples, our 7B-parameter model achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics

MethodsSoftmax · Attention Is All You Need