MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent

Yuxia Fu; Zhizhen Zhang; Yuqi Zhang; Zijian Wang; Zi Huang; Yadan Luo

arXiv:2511.18810·cs.RO·March 12, 2026

MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent

Yuxia Fu, Zhizhen Zhang, Yuqi Zhang, Zijian Wang, Zi Huang, Yadan Luo

PDF

Open Access 1 Models

TL;DR

MergeVLA introduces a novel architecture for vision-language-action models that enables effective merging of multiple skills into one model, maintaining high performance and enabling unsupervised task inference in robotic applications.

Contribution

The paper proposes MergeVLA, a new VLA architecture designed for mergeability, featuring task-specific sparse adapters and a task router, allowing multi-skill integration and generalization.

Findings

01

MergeVLA achieves comparable or better performance than individual experts.

02

It enables unsupervised task inference at test time.

03

Demonstrates robustness across multiple robotic benchmarks.

Abstract

Recent Vision-Language-Action (VLA) models reformulate vision-language models by tuning them with millions of robotic demonstrations. While they perform well when fine-tuned for a single embodiment or task family, extending them to multi-skill settings remains challenging: directly merging VLA experts trained on different tasks results in near-zero success rates. This raises a fundamental question: what prevents VLAs from mastering multiple skills within one model? With an empirical decomposition of learnable parameters during VLA fine-tuning, we identify two key sources of non-mergeability: (1) Finetuning drives LoRA adapters in the VLM backbone toward divergent, task-specific directions beyond the capacity of existing merging methods to unify. (2) Action experts develop inter-block dependencies through self-attention feedback, causing task information to spread across layers and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
FYX026/MergeVLA-LIBERO
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Robot Manipulation and Learning