MMSkills: Towards Multimodal Skills for General Visual Agents

Kangning Zhang; Shuai Shao; Qingyao Li; Jianghao Lin; Lingyue Fu; Shijian Wang; Wenxiang Jiao; Yuan Lu; Weiwen Liu; Weinan Zhang; Yong Yu

arXiv:2605.13527·cs.AI·May 15, 2026

MMSkills: Towards Multimodal Skills for General Visual Agents

Kangning Zhang, Shuai Shao, Qingyao Li, Jianghao Lin, Lingyue Fu, Shijian Wang, Wenxiang Jiao, Yuan Lu, Weiwen Liu, Weinan Zhang, Yong Yu

PDF

1 Repo 1 Datasets

TL;DR

MMSkills introduces a framework for creating and utilizing multimodal procedural knowledge packages that enhance visual decision-making in agents by integrating visual evidence with textual procedures.

Contribution

The paper formalizes multimodal skill packages, develops methods for their extraction from public data, and demonstrates their effectiveness in improving visual agent performance.

Findings

01

MMSkills improves agent performance on GUI and game benchmarks.

02

Multimodal procedural knowledge complements internal model priors.

03

The framework enables reusable, structured multimodal skills for visual agents.

Abstract

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deepexperience/MMSkills
github

Datasets

zhangkangning/mmskills
dataset· 2.6k dl
2.6k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.