SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models
Md Kaykobad Reza, Ameya Patil, Edward Ayrapetian, M. Salman Asif

TL;DR
SSAM introduces a training-free method to merge independently trained multimodal language models by aligning their parameter subspaces, enabling a single model to handle multiple modalities efficiently.
Contribution
The paper presents a novel Singular Subspace Alignment and Merging (SSAM) framework that merges pretrained MLLMs without additional training, addressing modality differences and parameter interference.
Findings
SSAM outperforms prior training-free merging methods.
SSAM surpasses jointly trained multimodal models on four datasets.
The approach is scalable and resource-efficient.
Abstract
Multimodal large language models (MLLMs) achieve strong performance by jointly processing inputs from multiple modalities, such as vision, audio, and language. However, building such models or extending them to new modalities often requires large paired datasets and substantial computational resources. Since many pretrained MLLMs (e.g., vision-language or audio-language) are publicly available, we ask whether we can merge them into a single MLLM that can handle multiple modalities? Merging MLLMs with different input modalities remains challenging, partly because of differences in the learned representations and interference between their parameter spaces. To address these challenges, we propose Singular Subspace Alignment and Merging (SSAM), a training-free model merging framework that unifies independently trained specialist MLLMs into a single model capable of handling any combination…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis
