GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents

Yuqi Zhou; Sunhao Dai; Shuai Wang; Kaiwen Zhou; Qinglin Jia; Jun Xu

arXiv:2505.15810·cs.CL·May 23, 2025

GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents

Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, Jun Xu

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper analyzes the training pipeline of R1-Zero-like GUI agents, identifies key challenges, and proposes targeted solutions that improve grounding accuracy, setting new state-of-the-art results with a 3B parameter model.

Contribution

It introduces three specific modifications to the training process—template design, reward function, and RL objective—that enhance GUI grounding performance.

Findings

01

Achieved 90.3% accuracy on ScreenSpot

02

Surpassed prior models of similar size

03

Outperformed larger models like UI-TARS-7B

Abstract

Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm, coupling online Reinforcement Learning (RL) with explicit chain-of-thought reasoning prior to object grounding and thereby achieving substantial performance gains. In this paper, we first conduct extensive analysis experiments of three key components of that training pipeline: input design, output evaluation, and policy update-each revealing distinct challenges arising from blindly applying general-purpose RL without adapting to GUI grounding tasks. Input design: Current templates encourage the model to generate chain-of-thought reasoning, but longer chains unexpectedly lead to worse grounding performance. Output evaluation: Reward functions based on hit signals or box area allow models to exploit box size, leading to reward hacking and poor localization quality. Policy update: Online RL tends to overfit easy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuqi-zhou/gui-g1
pytorchOfficial

Models

🤗
Yuqi-Zhou/GUI-G1-3B-v1
model· 10 dl· ♡ 2
10 dl♡ 2

Videos

GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents· slideslive

Taxonomy

TopicsRobotics and Automated Systems · Multimodal Machine Learning Applications

MethodsADaptive gradient method with the OPTimal convergence rate