Model Composition for Multimodal Large Language Models
Chi Chen, Yiyang Du, Zheng Fang, Ziyue Wang, Fuwen Luo, Peng Li, Ming, Yan, Ji Zhang, Fei Huang, Maosong Sun, Yang Liu

TL;DR
This paper introduces a novel model composition paradigm for Multimodal Large Language Models (MLLMs), enabling the creation of versatile models by combining existing models without extensive joint training, and demonstrates its effectiveness through new benchmarks and improved performance.
Contribution
It proposes a new paradigm for MLLMs via model composition, avoiding resource-intensive joint training, and introduces techniques to improve merging performance and a benchmark for evaluation.
Findings
Model composition effectively creates versatile MLLMs.
DAMC improves merging performance by addressing parameter interference.
Significant performance gains on multiple multimodal understanding tasks.
Abstract
Recent developments in Multimodal Large Language Models (MLLMs) have shown rapid progress, moving towards the goal of creating versatile MLLMs that understand inputs from various modalities. However, existing methods typically rely on joint training with paired multimodal instruction data, which is resource-intensive and challenging to extend to new modalities. In this paper, we propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model. Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters. Furthermore, we introduce DAMC to address parameter interference and mismatch issues during the merging process, thereby enhancing the model performance. To facilitate research in this area, we propose MCUB, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
