M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture
Hongyang Lei, Xiaolong Cheng, Qi Qin, Dan Wang, Kun Fan, Huazhen Huang, Qingqing Gu, Yetao Wu, Zhonglin Jiang, Yong Chen, Luo Ji

TL;DR
M3-JEPA introduces a multimodal learning framework using joint-embedding predictive architecture with a multi-gate MoE predictor, achieving state-of-the-art results, better generalization, and efficiency across diverse modalities and tasks.
Contribution
The paper proposes M3-JEPA, a novel multimodal learning framework that leverages joint-embedding predictive architecture with a multi-gate MoE predictor to improve alignment and generalization.
Findings
Achieves state-of-the-art performance on multiple multimodal tasks
Generalizes well to unseen datasets and domains
Is computationally efficient in training and inference
Abstract
Current multimodal learning strategies primarily optimize in the original token space. Such a framework is easy to incorporate with the backbone of pretrained language model, but might result in modality collapse. To alleviate such issues, we leverage the Joint-Embedding Predictive Architecture (JEPA) on the multimodal tasks, which converts the input embedding into the output embedding space by a predictor and then conducts the cross-modal alignment on the latent space. We implement this predictor by a Multi-Gate Mixture of Experts (MMoE) and name the framework as M3-JEPA, accordingly. The gating function disentangles the modality-specific and shared information and derives information-theoretic optimality. The framework is implemented with both contrastive and regularization loss, and solved by alternative gradient descent (AGD) between different multimodal tasks. By thoroughly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems
MethodsMixture of Experts
