PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging

Zibo Shao; Baochen Xiong; Xiaoshan Yang; Yaguang Song; Qimeng Zhang; Haifeng Chen; Changsheng Xu

arXiv:2604.22823·cs.CV·April 28, 2026

PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging

Zibo Shao, Baochen Xiong, Xiaoshan Yang, Yaguang Song, Qimeng Zhang, Haifeng Chen, Changsheng Xu

PDF

TL;DR

PivotMerge introduces a novel post-alignment merging framework for multimodal large language models, effectively integrating heterogeneous pre-trained models by addressing cross-domain interference and layer contribution disparities.

Contribution

The paper proposes PivotMerge, a new method for merging cross-modal projectors post-training, improving integration of heterogeneous multimodal models.

Findings

01

PivotMerge outperforms existing baselines on multiple benchmarks.

02

The framework effectively disentangles shared alignment patterns from domain-specific variations.

03

It demonstrates strong generalization across different multimodal scenarios.

Abstract

Multimodal Large Language Models (MLLMs) rely on multimodal pre-training over diverse data sources, where different datasets often induce complementary cross-modal alignment capabilities. Model merging provides a cost-effective mechanism for integrating multiple expert MLLMs with complementary strengths into a unified model. However, existing model merging research mainly focuses on post-finetuning scenarios, leaving the pre-training stage largely unexplored. We argue that the core of MLLM pre-training lies in establishing effective cross-modal alignment, which bridges visual and textual representations into a unified semantic space. Motivated by this insight, we introduce the post-alignment merging task, which aims to integrate cross-modal alignment capabilities learned from heterogeneous multimodal pre-training. This setting introduces two key challenges: cross-domain parameter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.