MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning

Liujian Tang; Shaokang Dong; Yijia Huang; Minqi Xiang; Hongtao Ruan; Bin Wang; Shuo Li; Zhiheng Xi; Zhihui Cao; Hailiang Pang; Heng Kong; He Yang; Mingxu Chai; Zhilin Gao; Xingyu Liu; Yingnan Fu; Jiaming Liu; Xuanjing Huang; Yu-Gang Jiang; Tao Gui; Qi Zhang; Kang Wang; Yunke Zhang; Yuran Wang

arXiv:2508.03700·cs.HC·September 12, 2025

MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning

Liujian Tang, Shaokang Dong, Yijia Huang, Minqi Xiang, Hongtao Ruan, Bin Wang, Shuo Li, Zhiheng Xi, Zhihui Cao, Hailiang Pang, Heng Kong, He Yang, Mingxu Chai, Zhilin Gao, Xingyu Liu, Yingnan Fu, Jiaming Liu, Xuanjing Huang, Yu-Gang Jiang, Tao Gui, Qi Zhang, Kang Wang

PDF

2 Models 1 Datasets

TL;DR

MagicGUI is a comprehensive mobile GUI agent framework that integrates a large diverse dataset, advanced perception, reasoning, and reinforcement fine-tuning to improve GUI understanding and interaction in real-world mobile environments.

Contribution

The paper introduces MagicGUI, a novel mobile GUI agent with a scalable data pipeline, multimodal grounding, planning reasoning, and reinforcement learning, advancing the state-of-the-art in GUI perception and interaction.

Findings

01

Achieved superior performance on Magic-RICH benchmark.

02

Demonstrated robust generalization in real-world scenarios.

03

Outperformed existing methods on multiple public benchmarks.

Abstract

This paper presents MagicGUI, a foundational mobile GUI agent designed to address critical challenges in perception, grounding, and reasoning within real-world mobile GUI environments. The framework is underpinned by following six key components: (1) a comprehensive and accurate dataset, constructed via the scalable GUI Data Pipeline, which aggregates the largest and most diverse GUI-centric multimodal data to date from open-source repositories, automated crawling, and targeted manual annotation; (2) enhanced perception and grounding capabilities, facilitating fine-grained multimodal alignment for UI element referencing, grounding, and screen comprehension; (3) a comprehensive and unified action space, encompassing both fundamental UI operations and complex interactive intents to support human-agent interactions; (4) planning-oriented reasoning mechanisms that enable the model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

GUIAgent/Magic-RICH
dataset· 81 dl
81 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.