GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents
Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, Jun Xu

TL;DR
This paper analyzes the training pipeline of R1-Zero-like GUI agents, identifies key challenges, and proposes targeted solutions that improve grounding accuracy, setting new state-of-the-art results with a 3B parameter model.
Contribution
It introduces three specific modifications to the training process—template design, reward function, and RL objective—that enhance GUI grounding performance.
Findings
Achieved 90.3% accuracy on ScreenSpot
Surpassed prior models of similar size
Outperformed larger models like UI-TARS-7B
Abstract
Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm, coupling online Reinforcement Learning (RL) with explicit chain-of-thought reasoning prior to object grounding and thereby achieving substantial performance gains. In this paper, we first conduct extensive analysis experiments of three key components of that training pipeline: input design, output evaluation, and policy update-each revealing distinct challenges arising from blindly applying general-purpose RL without adapting to GUI grounding tasks. Input design: Current templates encourage the model to generate chain-of-thought reasoning, but longer chains unexpectedly lead to worse grounding performance. Output evaluation: Reward functions based on hit signals or box area allow models to exploit box size, leading to reward hacking and poor localization quality. Policy update: Online RL tends to overfit easy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsRobotics and Automated Systems · Multimodal Machine Learning Applications
MethodsADaptive gradient method with the OPTimal convergence rate
