An Empirical Study of Multimodal Model Merging

Yi-Lin Sung; Linjie Li; Kevin Lin; Zhe Gan; Mohit Bansal; Lijuan Wang

arXiv:2304.14933·cs.CV·October 12, 2023·1 cites

An Empirical Study of Multimodal Model Merging

Yi-Lin Sung, Linjie Li, Kevin Lin, Zhe Gan, Mohit Bansal, Lijuan Wang

PDF

Open Access 1 Repo

TL;DR

This paper investigates the merging of multimodal transformers trained on different modalities, proposing metrics and training strategies to create a parameter-efficient, modality-agnostic model that achieves competitive performance across various tasks.

Contribution

It introduces a systematic study of multimodal model merging, proposes weight distance metrics, and develops a training recipe to match or surpass baseline performance.

Findings

01

Merging improves performance significantly over naive methods.

02

Proposed metrics effectively predict merging success.

03

Training strategies enable matching baseline performance with fewer parameters.

Abstract

Model merging (e.g., via interpolation or task arithmetic) fuses multiple models trained on different tasks to generate a multi-task solution. The technique has been proven successful in previous studies, where the models are trained on similar tasks and with the same initialization. In this paper, we expand on this concept to a multimodal setup by merging transformers trained on different modalities. Furthermore, we conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture to create a parameter-efficient modality-agnostic architecture. Through comprehensive experiments, we systematically investigate the key factors impacting model performance after merging, including initialization, merging mechanisms, and model architectures. We also propose two metrics that assess the distance between weights to be merged…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ylsung/vl-merging
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning