MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI   Understanding

Qinzhuo Wu; Weikai Xu; Wei Liu; Tao Tan; Jianfeng Liu; Ang Li; Jian; Luan; Bin Wang; Shuo Shang

arXiv:2409.14818·cs.CL·October 4, 2024

MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding

Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian, Luan, Bin Wang, Shuo Shang

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

MobileVLM is a novel vision-language model specifically designed for mobile UI understanding, incorporating specialized pre-training tasks and a large mobile dataset to improve recognition of UI elements and page transitions.

Contribution

The paper introduces MobileVLM with two new pre-training stages and a large Chinese mobile dataset, enhancing intra- and inter-UI understanding beyond general-domain VLMs.

Findings

01

MobileVLM outperforms existing VLMs on mobile benchmarks.

02

The model effectively captures fine-grained UI elements and page transition relationships.

03

Pre-training on Mobile3M improves UI element recognition and transition understanding.

Abstract

Recently, mobile AI agents based on VLMs have been gaining increasing attention. These works typically utilize VLM as a foundation, fine-tuning it with instruction-based mobile datasets. However, these VLMs are typically pre-trained on general-domain data, which often results in a lack of fundamental capabilities specific to the mobile domain. Therefore, they may struggle to recognize specific UI elements and understand intra-UI fine-grained information. In addition, the current fine-tuning task focuses on interacting with the most relevant element for the given instruction. These fine-tuned VLMs may still ignore the relationships between UI pages, neglect the roles of elements in page transitions and lack inter-UI understanding. To address issues, we propose a VLM called MobileVLM, which includes two additional pre-training stages to enhance both intra- and inter-UI understanding. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiaomi/mobilevlm
pytorchOfficial

Datasets

Videos

MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding· underline

Taxonomy

TopicsMobile and Web Applications · Context-Aware Activity Recognition Systems

MethodsSparse Evolutionary Training