AppVLM: A Lightweight Vision Language Model for Online App Control
Georgios Papoudakis, Thomas Coste, Zhihao Wu, Jianye Hao, Jun Wang,, Kun Shao

TL;DR
AppVLM is a lightweight vision-language model designed for efficient online app control, achieving high accuracy and success rates comparable to large models but with significantly reduced computational costs.
Contribution
The paper introduces a novel lightweight VLM that is fine-tuned for app control tasks, offering a practical and efficient alternative to large proprietary models.
Findings
Achieves highest offline action prediction accuracy on AndroidControl dataset.
Matches GPT-4o in online task success rate in AndroidWorld environment.
Up to ten times faster than large models during online deployment.
Abstract
The utilisation of foundation models as smartphone assistants, termed app agents, is a critical research challenge. These agents aim to execute human instructions on smartphones by interpreting textual instructions and performing actions via the device's interface. While promising, current approaches face significant limitations. Methods that use large proprietary models, such as GPT-4o, are computationally expensive, while those that use smaller fine-tuned models often lack adaptability to out-of-distribution tasks. In this work, we introduce AppVLM, a lightweight Vision-Language Model (VLM). First, we fine-tune it offline on the AndroidControl dataset. Then, we refine its policy by collecting data from the AndroidWorld environment and performing further training iterations. Our results indicate that AppVLM achieves the highest action prediction accuracy in offline evaluation on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Multimedia Communication and Technology · Mobile and Web Applications
