MobileFlow: A Multimodal LLM For Mobile GUI Agent
Songqin Nong, Jiali Zhu, Rui Wu, Jiongchao Jin, Shuo Shan, Xiutian, Huang, Wenhao Xu

TL;DR
MobileFlow is a multimodal large language model designed for mobile GUI agents, supporting variable image resolutions and multilingual interfaces, outperforming existing models in GUI task execution and practical deployment.
Contribution
The paper introduces MobileFlow, a novel 21-billion-parameter multimodal LLM with hybrid visual encoders and innovative training strategies tailored for mobile GUI understanding and interaction, especially in Chinese.
Findings
Outperforms Qwen-VL-Max and GPT-4v in GUI task execution.
Supports variable image resolutions and multilingual GUIs.
Successfully deployed in real-world business applications.
Abstract
Currently, the integration of mobile Graphical User Interfaces (GUIs) is ubiquitous in most people's daily lives. And the ongoing evolution of multimodal large-scale models, such as GPT-4v, Qwen-VL-Max, has significantly bolstered the capabilities of GUI comprehension and user action analysis, showcasing the potentiality of intelligent GUI assistants. However, current GUI Agents often need to access page layout information through calling system APIs, which may pose privacy risks. Fixing GUI (such as mobile interfaces) to a certain low resolution might result in the loss of fine-grained image details. At the same time, the multimodal large models built for GUI Agents currently have poor understanding and decision-making abilities for Chinese GUI interfaces, making them difficult to apply to a large number of Chinese apps. This paper introduces MobileFlow, a multimodal large language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Agent-Based Network Management · Multi-Agent Systems and Negotiation · Speech and dialogue systems
