Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
Ziwei Wang, Junjie Zheng, Leyang Yang, Sheng Zhou, Xiaoxuan Tang, Zhouhua Fang, Zhiwei Liu, Dajun Chen, Yong Li, Jiajun Bu

TL;DR
This paper introduces LAMO, a framework that enhances lightweight multimodal large language models for GUI automation, enabling scalable multi-role orchestration and improved task performance on resource-limited devices.
Contribution
The paper presents a novel training framework combining supervised fine-tuning and reinforcement learning to enable lightweight MLLMs to participate effectively in complex GUI workflows.
Findings
LAMO-3B supports multi-role GUI automation and orchestration.
LAMO-3B achieves improved performance in static and online evaluations.
The framework allows continuous benefit from advanced planners as plug-and-play modules.
Abstract
Autonomous Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) enable digital automation on end-user devices. While scaling both parameters and data has yielded substantial gains, advanced methods still suffer from prohibitive deployment costs on resource-constrained devices. When facing complex in-the-wild scenarios, lightweight GUI agents are bottlenecked by limited capacity and poor task scalability under end-to-end episodic learning, impeding adaptation to multi-agent systems (MAS), while training multiple skill-specific experts remains costly. Can we strike an effective trade-off in this cost-scalability dilemma, enabling lightweight MLLMs to participate in realistic GUI workflows? To address these challenges, we propose the LAMO framework, which endows a lightweight MLLM with GUI-specific knowledge and task scalability, allowing multi-role…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
