PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging
Zibo Shao, Baochen Xiong, Xiaoshan Yang, Yaguang Song, Qimeng Zhang, Haifeng Chen, Changsheng Xu

TL;DR
PivotMerge introduces a novel post-alignment merging framework for multimodal large language models, effectively integrating heterogeneous pre-trained models by addressing cross-domain interference and layer contribution disparities.
Contribution
The paper proposes PivotMerge, a new method for merging cross-modal projectors post-training, improving integration of heterogeneous multimodal models.
Findings
PivotMerge outperforms existing baselines on multiple benchmarks.
The framework effectively disentangles shared alignment patterns from domain-specific variations.
It demonstrates strong generalization across different multimodal scenarios.
Abstract
Multimodal Large Language Models (MLLMs) rely on multimodal pre-training over diverse data sources, where different datasets often induce complementary cross-modal alignment capabilities. Model merging provides a cost-effective mechanism for integrating multiple expert MLLMs with complementary strengths into a unified model. However, existing model merging research mainly focuses on post-finetuning scenarios, leaving the pre-training stage largely unexplored. We argue that the core of MLLM pre-training lies in establishing effective cross-modal alignment, which bridges visual and textual representations into a unified semantic space. Motivated by this insight, we introduce the post-alignment merging task, which aims to integrate cross-modal alignment capabilities learned from heterogeneous multimodal pre-training. This setting introduces two key challenges: cross-domain parameter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
