Towards Scalable Lightweight GUI Agents via Multi-role Orchestration

Ziwei Wang; Junjie Zheng; Leyang Yang; Sheng Zhou; Xiaoxuan Tang; Zhouhua Fang; Zhiwei Liu; Dajun Chen; Yong Li; Jiajun Bu

arXiv:2604.13488·cs.AI·April 16, 2026

Towards Scalable Lightweight GUI Agents via Multi-role Orchestration

Ziwei Wang, Junjie Zheng, Leyang Yang, Sheng Zhou, Xiaoxuan Tang, Zhouhua Fang, Zhiwei Liu, Dajun Chen, Yong Li, Jiajun Bu

PDF

TL;DR

This paper introduces LAMO, a framework that enhances lightweight multimodal large language models for GUI automation, enabling scalable multi-role orchestration and improved task performance on resource-limited devices.

Contribution

The paper presents a novel training framework combining supervised fine-tuning and reinforcement learning to enable lightweight MLLMs to participate effectively in complex GUI workflows.

Findings

01

LAMO-3B supports multi-role GUI automation and orchestration.

02

LAMO-3B achieves improved performance in static and online evaluations.

03

The framework allows continuous benefit from advanced planners as plug-and-play modules.

Abstract

Autonomous Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) enable digital automation on end-user devices. While scaling both parameters and data has yielded substantial gains, advanced methods still suffer from prohibitive deployment costs on resource-constrained devices. When facing complex in-the-wild scenarios, lightweight GUI agents are bottlenecked by limited capacity and poor task scalability under end-to-end episodic learning, impeding adaptation to multi-agent systems (MAS), while training multiple skill-specific experts remains costly. Can we strike an effective trade-off in this cost-scalability dilemma, enabling lightweight MLLMs to participate in realistic GUI workflows? To address these challenges, we propose the LAMO framework, which endows a lightweight MLLM with GUI-specific knowledge and task scalability, allowing multi-role…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.