MobileA3gent: Training Mobile GUI Agents Using Decentralized Self-Sourced Data from Diverse Users
Wenhao Wang, Mengying Yuan, Zijie Yu, Guangyi Liu, Rui Ye, Tian Jin, Siheng Chen, Yanfeng Wang

TL;DR
MobileA3gent introduces a decentralized, privacy-preserving framework for training mobile GUI agents using self-sourced data from diverse users, significantly reducing costs and improving performance.
Contribution
The paper presents a novel collaborative framework combining auto-annotation and federated learning to train mobile GUI agents without human labeling or centralized data collection.
Findings
Achieves superior performance over traditional methods
Reduces data collection costs to 1% of conventional approaches
Effectively handles non-IID data distributions in federated training
Abstract
The advancement of mobile GUI agents has opened new opportunities for automating tasks on mobile devices. Training these agents requires large-scale high-quality data, which is prohibitively expensive when relying on human labor. Given the vast population of global mobile phone users, if automated data collection from them becomes feasible, the resulting data volume and the subsequently trained mobile agents could reach unprecedented levels. Nevertheless, two major challenges arise: (1) extracting user instructions without human intervention and (2) utilizing distributed user data while preserving privacy. To tackle these challenges, we propose MobileA3gent, a collaborative framework that trains mobile GUI Agents using decentralized self-sourced data from diverse users. The framework comprises two components, each targeting a specific challenge: (1) Auto-Annotation, which enables the…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Interesting topic of mobile GUI agents 2. Good comparison with existing baselines 3. Good quality of figures, use of font sizes etc. 4. Appreciate the open sourcing of the code.
1. Overall the paper was not an easy read. While there are no spelling/grammar issues, I took me a while to understand the motivation of this work, how they collect the data, where the data are processed/annotated through the VLM, the type of data that are collected etc. I strongly recommend the authors to revise the manuscript and provide clear examples of data samples, and system design diagram of the system's pipeline. I would also avoid the inline tables and figures (just a friendly suggesti
The work addresses a pressing issue in the GUI agent community — the cost and scalability of data collection for training agents. By enabling distributed, self-sourced data annotation, the paper proposes a paradigm shift toward user-centric, privacy-preserving model training. The Auto-Annotation mechanism is conceptually novel, integrating low-level (step-wise) and high-level (episode-wise) VLM reasoning to generate human-like task instructions. The Adapted Aggregation in FedVLM-A introduces a
Although the system is motivated by decentralized real-user data, all experiments use public datasets, not actual on-device data collection. Thus, claims about privacy, scalability, and real-world feasibility remain unverified empirically. The gap between simulation and deployment is significant. While FedVLM-A preserves privacy by design, the paper provides no formal privacy analysis (e.g., differential privacy bounds or adversarial leakage evaluation). Table 1’s qualitative comparison is insu
* Practical FL innovation: Introduces an adapted aggregation method that jointly weights episode and step counts, effectively addressing two-level data heterogeneity with a simple, interpretable design. * The system is technically sound and thoroughly validated across four benchmarks with non-IID splits, ablations, and scaling analyses showing consistent gains over FL baselines. * Well-written and easy to follow, supported by clear figures and organized methodology. * Achieves human-annotatio
* No on-device evaluation for either annotation or training, leaving open questions on latency, energy, and memory feasibility. * Annotation models (e.g., Qwen2-VL-7B) exceed mobile capacity, creating a gap between intended on-device use and evaluated setups.
Videos
Taxonomy
TopicsMobile Agent-Based Network Management · Peer-to-Peer Network Technologies · Context-Aware Activity Recognition Systems
