Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills
Yuquan Xie, Zaijing Li, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Dongmei Jiang, Liqiang Nie

TL;DR
Mirage-1 introduces a hierarchical skills framework and a skill-augmented search algorithm to enhance multimodal GUI agents, significantly improving their performance in real-world long-horizon tasks across multiple platforms.
Contribution
The paper proposes a hierarchical multimodal skills module and a skill-augmented Monte Carlo Tree Search algorithm, enabling GUI agents to better generalize knowledge and perform long-horizon tasks online.
Findings
Mirage-1 outperforms previous agents by up to 79% on new benchmarks.
Hierarchical skills improve long-horizon task planning.
SA-MCTS reduces action search space effectively.
Abstract
Recent efforts to leverage the Multi-modal Large Language Model (MLLM) as GUI agents have yielded promising outcomes. However, these agents still struggle with long-horizon tasks in online environments, primarily due to insufficient knowledge and the inherent gap between offline and online domains. In this paper, inspired by how humans generalize knowledge in open-ended environments, we propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge. It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning. To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm, which efficiently leverages skills acquired in offline environments to reduce the action search space during online tree…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
