Multimodal LLMs under Pairwise Modalities
Yan Li, Yunlong Deng, Yuewen Sun, Gongxu Luo, Kun Zhang, Guangyi Chen

TL;DR
This paper proposes a novel framework for training multimodal large language models using only pairwise modality data, enabling scalable cross-modal learning without full joint datasets.
Contribution
It introduces a theoretical analysis of representation identifiability with pairwise data and a two-stage learning framework for aligning and recomposing multimodal representations.
Findings
Successfully added 3D point clouds and tactile modalities to pre-trained MLLMs.
Achieved strong cross-modal performance through aligned latent representations.
Demonstrated the feasibility of training MLLMs with only pairwise modality data.
Abstract
Despite the impressive results achieved by multimodal large language models (MLLMs), their training typically relies on jointly curated multimodal data, requiring substantial human effort to construct multi-way aligned datasets and thereby limiting scalability across domains. In this work, we explore training MLLMs by only leveraging multiple paired modalities as a surrogate for the full joint multimodal distribution. Specifically, we first provide a theoretical analysis of the conditions under which the representations are identifiable with only observing pairwise modalities. Building on this analysis, we propose a representation learning framework for aligning latent representations across modalities using only pairwise data. The framework consists of two stages: latent representation alignment and cross-modal recomposition. Specifically, in the first stage, we learn the shared latent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
