SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models

Md Kaykobad Reza; Ameya Patil; Edward Ayrapetian; M. Salman Asif

arXiv:2603.21584·cs.LG·March 24, 2026

SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models

Md Kaykobad Reza, Ameya Patil, Edward Ayrapetian, M. Salman Asif

PDF

Open Access

TL;DR

SSAM introduces a training-free method to merge independently trained multimodal language models by aligning their parameter subspaces, enabling a single model to handle multiple modalities efficiently.

Contribution

The paper presents a novel Singular Subspace Alignment and Merging (SSAM) framework that merges pretrained MLLMs without additional training, addressing modality differences and parameter interference.

Findings

01

SSAM outperforms prior training-free merging methods.

02

SSAM surpasses jointly trained multimodal models on four datasets.

03

The approach is scalable and resource-efficient.

Abstract

Multimodal large language models (MLLMs) achieve strong performance by jointly processing inputs from multiple modalities, such as vision, audio, and language. However, building such models or extending them to new modalities often requires large paired datasets and substantial computational resources. Since many pretrained MLLMs (e.g., vision-language or audio-language) are publicly available, we ask whether we can merge them into a single MLLM that can handle multiple modalities? Merging MLLMs with different input modalities remains challenging, partly because of differences in the learned representations and interference between their parameter spaces. To address these challenges, we propose Singular Subspace Alignment and Merging (SSAM), a training-free model merging framework that unifies independently trained specialist MLLMs into a single model capable of handling any combination…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis