DigiData: Training and Evaluating General-Purpose Mobile Control Agents

Yuxuan Sun; Manchen Wang; Shengyi Qian; William R. Wong; Eric Gan; Pierluca D'Oro; Alejandro Castillejo Munoz; Sneha Silwal; Pedro Matias; Nitin Kamra; Satwik Kottur; Nick Raines; Xuanyi Zhao; Joy Chen; Joseph Greer; Andrea Madotto; Allen Bolourchi; James Valori; Kevin Carlberg; Karl Ridgeway; Joseph Tighe

arXiv:2511.07413·cs.AI·November 13, 2025

DigiData: Training and Evaluating General-Purpose Mobile Control Agents

Yuxuan Sun, Manchen Wang, Shengyi Qian, William R. Wong, Eric Gan, Pierluca D'Oro, Alejandro Castillejo Munoz, Sneha Silwal, Pedro Matias, Nitin Kamra, Satwik Kottur, Nick Raines, Xuanyi Zhao, Joy Chen, Joseph Greer, Andrea Madotto, Allen Bolourchi, James Valori, Kevin Carlberg

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

DigiData introduces a large, diverse dataset and evaluation benchmark for training and assessing mobile control agents, addressing current limitations in goal complexity and evaluation reliability to advance human-device interaction.

Contribution

The paper presents DigiData, a novel high-quality dataset and DigiData-Bench, a comprehensive evaluation benchmark for mobile control agents, with improved metrics and protocols.

Findings

01

DigiData offers greater diversity and goal complexity than existing datasets.

02

Proposed evaluation methods provide more reliable assessment of agent performance.

03

Dynamic evaluation protocols outperform traditional step-accuracy metrics.

Abstract

AI agents capable of controlling user interfaces have the potential to transform human interaction with digital devices. To accelerate this transformation, two fundamental building blocks are essential: high-quality datasets that enable agents to achieve complex and human-relevant goals, and robust evaluation methods that allow researchers and practitioners to rapidly enhance agent performance. In this paper, we introduce DigiData, a large-scale, high-quality, diverse, multi-modal dataset designed for training mobile control agents. Unlike existing datasets, which derive goals from unstructured interactions, DigiData is meticulously constructed through comprehensive exploration of app features, resulting in greater diversity and higher goal complexity. Additionally, we present DigiData-Bench, a benchmark for evaluating mobile control agents on real-world complex tasks. We demonstrate…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

The dataset is large-scale and diversified, facilitating robust training and evaluation of mobile agents.

Weaknesses

- In the trajectory verification process, which specific LLM is used as the judge? Does this method incur high costs? - There is a lack of comparison with other state-of-the-art mobile agent methods, such as UI-TARS[1], UI-Genie[2], and the Mobile-Agent series[3]. - Digidata-bench is proposed to address the inaccuracies of step accuracy metrics in offline evaluation. However, several online dynamic evaluation benchmarks already exist (e.g., AndroidWorld[4], SPA-Bench[5], A3[6]), and methods li

Reviewer 02Rating 4Confidence 5

Strengths

DigiData provides a well-structured and reproducible dataset with higher goal complexity and diversity than prior work. Experimental results are robust and clearly demonstrate performance gains and scaling effects. The dataset and benchmark together contribute to a valuable foundation for research in mobile control.

Weaknesses

While DigiData is carefully engineered, it shows limited methodological novelty. Its contributions mainly concern dataset scale and organization rather than new paradigms. The motivation for creating DigiData is not clearly articulated, as existing datasets like AitW and AndroidControl already support complex mobile control research with similar design goals. The introduction of LLM judges is also incremental rather than innovative, and the paper does not situate its evaluation method within the

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper's appendix provides exhaustive details on data construction and in-depth data analysis. 2. By leveraging skilled annotators who systematically explore application functionalities based on a goal generation protocol, the generated goals cover advanced features of the apps, enhancing the dataset's depth and utility. 3. DigiData is currently the second-largest mobile control dataset. It boasts high quality (100% trajectory verification pass rate) and provides rich information, includin

Weaknesses

1. The experimental evaluation covers limited datasets and benchmarks, omitting important comparisons such as the GUIOdyssey [1] dataset, which features longer interaction sequences (average 15.3 steps) than the proposed DigiData. 2. The experimental design lacks consistent evaluation conditions and ablation studies, making it unclear whether DigiData independently enhances generalization. Current results merely show good performance on AitW and DigiData-Bench, without achieving SOTA on AndroidC

Code & Models

Datasets

facebook/DigiData
dataset· 326 dl
326 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Human-Automation Interaction and Safety · Explainable Artificial Intelligence (XAI)