Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

Yuquan Xie; Zaijing Li; Rui Shao; Gongwei Chen; Kaiwen Zhou; Yinchuan Li; Dongmei Jiang; Liqiang Nie

arXiv:2506.10387·cs.AI·June 13, 2025

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

Yuquan Xie, Zaijing Li, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Dongmei Jiang, Liqiang Nie

PDF

Open Access

TL;DR

Mirage-1 introduces a hierarchical skills framework and a skill-augmented search algorithm to enhance multimodal GUI agents, significantly improving their performance in real-world long-horizon tasks across multiple platforms.

Contribution

The paper proposes a hierarchical multimodal skills module and a skill-augmented Monte Carlo Tree Search algorithm, enabling GUI agents to better generalize knowledge and perform long-horizon tasks online.

Findings

01

Mirage-1 outperforms previous agents by up to 79% on new benchmarks.

02

Hierarchical skills improve long-horizon task planning.

03

SA-MCTS reduces action search space effectively.

Abstract

Recent efforts to leverage the Multi-modal Large Language Model (MLLM) as GUI agents have yielded promising outcomes. However, these agents still struggle with long-horizon tasks in online environments, primarily due to insufficient knowledge and the inherent gap between offline and online domains. In this paper, inspired by how humans generalize knowledge in open-ended environments, we propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge. It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning. To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm, which efficiently leverages skills acquired in offline environments to reduce the action search space during online tree…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems