MAny: Merge Anything for Multimodal Continual Instruction Tuning

Zijian Gao; Wangwang Jia; Xingxing Zhang; Pengfei Qian; Tao Sun; Bo Ding; Yong Dou; Huaimin Wang; and Kele Xu

arXiv:2604.14016·cs.LG·April 16, 2026

MAny: Merge Anything for Multimodal Continual Instruction Tuning

Zijian Gao, Wangwang Jia, Xingxing Zhang, Pengfei Qian, Tao Sun, Bo Ding, Yong Dou, Huaimin Wang, and Kele Xu

PDF

TL;DR

MAny introduces a training-free framework for multimodal continual instruction tuning that merges task knowledge to prevent forgetting, improving performance across multiple models and benchmarks.

Contribution

The paper proposes MAny, a novel merging-based approach that addresses dual-forgetting in multimodal models without additional training, using cross-modal projection and low-rank parameter merging.

Findings

01

MAny achieves up to 8.57% accuracy improvement on UCIT benchmark.

02

MAny effectively prevents catastrophic forgetting in MLLMs.

03

MAny operates with CPU-based algebraic operations, eliminating extra training.

Abstract

Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \textbf{MAny} (\textbf{M}erge \textbf{Any}thing), a framework that merges task-specific knowledge through \textbf{C}ross-modal \textbf{P}rojection \textbf{M}erging (\textbf{CPM}) and \textbf{L}ow-rank \textbf{P}arameter \textbf{M}erging (\textbf{LPM}). Specifically, CPM recovers perceptual alignment by adaptively merging cross-modal visual representations via visual-prototype guidance, ensuring accurate feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.