E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion

Ke Wang; Tianyu Xia; Zhangxuan Gu; Yi Zhao; Shuheng Shen; Changhua; Meng; Weiqiang Wang; Ke Xu

arXiv:2406.14250·cs.CV·July 2, 2024

E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion

Ke Wang, Tianyu Xia, Zhangxuan Gu, Yi Zhao, Shuheng Shen, Changhua, Meng, Weiqiang Wang, Ke Xu

PDF

Open Access

TL;DR

This paper introduces E-ANT, a large-scale Chinese GUI navigation dataset with real human traces and high-quality screenshots, aimed at advancing multimodal large language models' ability to perform efficient mobile GUI navigation.

Contribution

The paper presents E-ANT, the first Chinese GUI navigation dataset with real human behavior data, high-quality annotations, and extensive coverage of over 5000 tinyAPPs, facilitating improved model training and evaluation.

Findings

01

MLLMs show promising performance on E-ANT

02

The dataset enables effective evaluation of GUI navigation models

03

Abalation studies highlight key factors influencing model accuracy

Abstract

Online GUI navigation on mobile devices has driven a lot of attention recent years since it contributes to many real-world applications. With the rapid development of large language models (LLM), multimodal large language models (MLLM) have tremendous potential on this task. However, existing MLLMs need high quality data to improve its abilities of making the correct navigation decisions according to the human user inputs. In this paper, we developed a novel and highly valuable dataset, named \textbf{E-ANT}, as the first Chinese GUI navigation dataset that contains real human behaviour and high quality screenshots with annotations, containing nearly 40,000 real human traces over 5000+ different tinyAPPs. Furthermore, we evaluate various powerful MLLMs on E-ANT and show their experiments results with sufficient ablations. We believe that our proposed dataset will be beneficial for both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Interactive and Immersive Displays · Context-Aware Activity Recognition Systems

MethodsSoftmax · Attention Is All You Need