MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian, Luan, Bin Wang, Shuo Shang

TL;DR
MobileVLM is a novel vision-language model specifically designed for mobile UI understanding, incorporating specialized pre-training tasks and a large mobile dataset to improve recognition of UI elements and page transitions.
Contribution
The paper introduces MobileVLM with two new pre-training stages and a large Chinese mobile dataset, enhancing intra- and inter-UI understanding beyond general-domain VLMs.
Findings
MobileVLM outperforms existing VLMs on mobile benchmarks.
The model effectively captures fine-grained UI elements and page transition relationships.
Pre-training on Mobile3M improves UI element recognition and transition understanding.
Abstract
Recently, mobile AI agents based on VLMs have been gaining increasing attention. These works typically utilize VLM as a foundation, fine-tuning it with instruction-based mobile datasets. However, these VLMs are typically pre-trained on general-domain data, which often results in a lack of fundamental capabilities specific to the mobile domain. Therefore, they may struggle to recognize specific UI elements and understand intra-UI fine-grained information. In addition, the current fine-tuning task focuses on interacting with the most relevant element for the given instruction. These fine-tuned VLMs may still ignore the relationships between UI pages, neglect the roles of elements in page transitions and lack inter-UI understanding. To address issues, we propose a VLM called MobileVLM, which includes two additional pre-training stages to enhance both intra- and inter-UI understanding. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMobile and Web Applications · Context-Aware Activity Recognition Systems
MethodsSparse Evolutionary Training
