Model Composition for Multimodal Large Language Models

Chi Chen; Yiyang Du; Zheng Fang; Ziyue Wang; Fuwen Luo; Peng Li; Ming; Yan; Ji Zhang; Fei Huang; Maosong Sun; Yang Liu

arXiv:2402.12750·cs.CV·July 29, 2024·3 cites

Model Composition for Multimodal Large Language Models

Chi Chen, Yiyang Du, Zheng Fang, Ziyue Wang, Fuwen Luo, Peng Li, Ming, Yan, Ji Zhang, Fei Huang, Maosong Sun, Yang Liu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel model composition paradigm for Multimodal Large Language Models (MLLMs), enabling the creation of versatile models by combining existing models without extensive joint training, and demonstrates its effectiveness through new benchmarks and improved performance.

Contribution

It proposes a new paradigm for MLLMs via model composition, avoiding resource-intensive joint training, and introduces techniques to improve merging performance and a benchmark for evaluation.

Findings

01

Model composition effectively creates versatile MLLMs.

02

DAMC improves merging performance by addressing parameter interference.

03

Significant performance gains on multiple multimodal understanding tasks.

Abstract

Recent developments in Multimodal Large Language Models (MLLMs) have shown rapid progress, moving towards the goal of creating versatile MLLMs that understand inputs from various modalities. However, existing methods typically rely on joint training with paired multimodal instruction data, which is resource-intensive and challenging to extend to new modalities. In this paper, we propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model. Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters. Furthermore, we introduce DAMC to address parameter interference and mismatch issues during the merging process, thereby enhancing the model performance. To facilitate research in this area, we propose MCUB, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp-mt/modelcompose
pytorchOfficial

Videos

Model Composition for Multimodal Large Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems