M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture

Hongyang Lei; Xiaolong Cheng; Qi Qin; Dan Wang; Kun Fan; Huazhen Huang; Qingqing Gu; Yetao Wu; Zhonglin Jiang; Yong Chen; Luo Ji

arXiv:2409.05929·cs.LG·June 19, 2025

M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture

Hongyang Lei, Xiaolong Cheng, Qi Qin, Dan Wang, Kun Fan, Huazhen Huang, Qingqing Gu, Yetao Wu, Zhonglin Jiang, Yong Chen, Luo Ji

PDF

Open Access 1 Repo 1 Video

TL;DR

M3-JEPA introduces a multimodal learning framework using joint-embedding predictive architecture with a multi-gate MoE predictor, achieving state-of-the-art results, better generalization, and efficiency across diverse modalities and tasks.

Contribution

The paper proposes M3-JEPA, a novel multimodal learning framework that leverages joint-embedding predictive architecture with a multi-gate MoE predictor to improve alignment and generalization.

Findings

01

Achieves state-of-the-art performance on multiple multimodal tasks

02

Generalizes well to unseen datasets and domains

03

Is computationally efficient in training and inference

Abstract

Current multimodal learning strategies primarily optimize in the original token space. Such a framework is easy to incorporate with the backbone of pretrained language model, but might result in modality collapse. To alleviate such issues, we leverage the Joint-Embedding Predictive Architecture (JEPA) on the multimodal tasks, which converts the input embedding into the output embedding space by a predictor and then conducts the cross-modal alignment on the latent space. We implement this predictor by a Multi-Gate Mixture of Experts (MMoE) and name the framework as M3-JEPA, accordingly. The gating function disentangles the modality-specific and shared information and derives information-theoretic optimality. The framework is implemented with both contrastive and regularization loss, and solved by alternative gradient descent (AGD) between different multimodal tasks. By thoroughly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HongyangLL/M3-JEPA
pytorchOfficial

Videos

M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture· slideslive

Taxonomy

TopicsSpeech and dialogue systems

MethodsMixture of Experts