GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Ping Luo

TL;DR
GUIOdyssey introduces a large, detailed dataset for training and evaluating autonomous agents on complex cross-app GUI navigation tasks on mobile devices, addressing limitations of prior single-app datasets.
Contribution
We present GUIOdyssey, a comprehensive dataset for cross-app mobile GUI navigation, and develop OdysseyAgent, a multimodal agent leveraging historical context for improved navigation performance.
Findings
OdysseyAgent outperforms baselines in cross-app navigation tasks.
Historical context significantly improves agent accuracy.
Dataset covers diverse apps and device scenarios.
Abstract
Autonomous Graphical User Interface (GUI) navigation agents can enhance user experience in communication, entertainment, and productivity by streamlining workflows and reducing manual intervention. However, prior GUI agents often trained with datasets comprising tasks that can be completed within a single app, leading to poor performance in cross-app navigation. To address this problem, we present GUIOdyssey, a comprehensive dataset for cross-app mobile GUI navigation. GUIOdyssey comprises 8,334 episodes with an average of 15.3 steps per episode, covering 6 mobile devices, 212 distinct apps, and 1,357 app combinations. Each step is enriched with detailed semantic reasoning annotations, which aid the model in building cognitive processes and enhancing its reasoning abilities for complex cross-app tasks. Building on GUIOdyssey, we develop OdysseyAgent, an exploratory multimodal agent for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Mobile and Web Applications · Interactive and Immersive Displays
