MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent
Yuxia Fu, Zhizhen Zhang, Yuqi Zhang, Zijian Wang, Zi Huang, Yadan Luo

TL;DR
MergeVLA introduces a novel architecture for vision-language-action models that enables effective merging of multiple skills into one model, maintaining high performance and enabling unsupervised task inference in robotic applications.
Contribution
The paper proposes MergeVLA, a new VLA architecture designed for mergeability, featuring task-specific sparse adapters and a task router, allowing multi-skill integration and generalization.
Findings
MergeVLA achieves comparable or better performance than individual experts.
It enables unsupervised task inference at test time.
Demonstrates robustness across multiple robotic benchmarks.
Abstract
Recent Vision-Language-Action (VLA) models reformulate vision-language models by tuning them with millions of robotic demonstrations. While they perform well when fine-tuned for a single embodiment or task family, extending them to multi-skill settings remains challenging: directly merging VLA experts trained on different tasks results in near-zero success rates. This raises a fundamental question: what prevents VLAs from mastering multiple skills within one model? With an empirical decomposition of learnable parameters during VLA fine-tuning, we identify two key sources of non-mergeability: (1) Finetuning drives LoRA adapters in the VLM backbone toward divergent, task-specific directions beyond the capacity of existing merging methods to unify. (2) Action experts develop inter-block dependencies through self-attention feedback, causing task information to spread across layers and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Robot Manipulation and Learning
