DigiData: Training and Evaluating General-Purpose Mobile Control Agents
Yuxuan Sun, Manchen Wang, Shengyi Qian, William R. Wong, Eric Gan, Pierluca D'Oro, Alejandro Castillejo Munoz, Sneha Silwal, Pedro Matias, Nitin Kamra, Satwik Kottur, Nick Raines, Xuanyi Zhao, Joy Chen, Joseph Greer, Andrea Madotto, Allen Bolourchi, James Valori, Kevin Carlberg

TL;DR
DigiData introduces a large, diverse dataset and evaluation benchmark for training and assessing mobile control agents, addressing current limitations in goal complexity and evaluation reliability to advance human-device interaction.
Contribution
The paper presents DigiData, a novel high-quality dataset and DigiData-Bench, a comprehensive evaluation benchmark for mobile control agents, with improved metrics and protocols.
Findings
DigiData offers greater diversity and goal complexity than existing datasets.
Proposed evaluation methods provide more reliable assessment of agent performance.
Dynamic evaluation protocols outperform traditional step-accuracy metrics.
Abstract
AI agents capable of controlling user interfaces have the potential to transform human interaction with digital devices. To accelerate this transformation, two fundamental building blocks are essential: high-quality datasets that enable agents to achieve complex and human-relevant goals, and robust evaluation methods that allow researchers and practitioners to rapidly enhance agent performance. In this paper, we introduce DigiData, a large-scale, high-quality, diverse, multi-modal dataset designed for training mobile control agents. Unlike existing datasets, which derive goals from unstructured interactions, DigiData is meticulously constructed through comprehensive exploration of app features, resulting in greater diversity and higher goal complexity. Additionally, we present DigiData-Bench, a benchmark for evaluating mobile control agents on real-world complex tasks. We demonstrate…
Peer Reviews
Decision·Submitted to ICLR 2026
The dataset is large-scale and diversified, facilitating robust training and evaluation of mobile agents.
- In the trajectory verification process, which specific LLM is used as the judge? Does this method incur high costs? - There is a lack of comparison with other state-of-the-art mobile agent methods, such as UI-TARS[1], UI-Genie[2], and the Mobile-Agent series[3]. - Digidata-bench is proposed to address the inaccuracies of step accuracy metrics in offline evaluation. However, several online dynamic evaluation benchmarks already exist (e.g., AndroidWorld[4], SPA-Bench[5], A3[6]), and methods li
DigiData provides a well-structured and reproducible dataset with higher goal complexity and diversity than prior work. Experimental results are robust and clearly demonstrate performance gains and scaling effects. The dataset and benchmark together contribute to a valuable foundation for research in mobile control.
While DigiData is carefully engineered, it shows limited methodological novelty. Its contributions mainly concern dataset scale and organization rather than new paradigms. The motivation for creating DigiData is not clearly articulated, as existing datasets like AitW and AndroidControl already support complex mobile control research with similar design goals. The introduction of LLM judges is also incremental rather than innovative, and the paper does not situate its evaluation method within the
1. The paper's appendix provides exhaustive details on data construction and in-depth data analysis. 2. By leveraging skilled annotators who systematically explore application functionalities based on a goal generation protocol, the generated goals cover advanced features of the apps, enhancing the dataset's depth and utility. 3. DigiData is currently the second-largest mobile control dataset. It boasts high quality (100% trajectory verification pass rate) and provides rich information, includin
1. The experimental evaluation covers limited datasets and benchmarks, omitting important comparisons such as the GUIOdyssey [1] dataset, which features longer interaction sequences (average 15.3 steps) than the proposed DigiData. 2. The experimental design lacks consistent evaluation conditions and ablation studies, making it unclear whether DigiData independently enhances generalization. Current results merely show good performance on AitW and DigiData-Bench, without achieving SOTA on AndroidC
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Robot Interaction and HRI · Human-Automation Interaction and Safety · Explainable Artificial Intelligence (XAI)
