MoE3D: Mixture of Experts meets Multi-Modal 3D Understanding
Yu Li, Yuenan Hou, Yingmei Wei, Xinge Zhu, Yuexin Ma, Wenqi Shao, Yanming Guo

TL;DR
MoE3D introduces a Mixture of Experts framework for multi-modal 3D understanding, enhancing fusion performance and efficiency through specialized experts, progressive pre-training, and a novel information aggregation module, achieving state-of-the-art results.
Contribution
The paper presents a novel MoE-based transformer architecture with a progressive pre-training strategy for improved multi-modal 3D understanding.
Findings
Achieves 6.1 mIoU improvement on Multi3DRefer.
Outperforms previous methods across four 3D understanding tasks.
Demonstrates effective modality-specific expert specialization.
Abstract
Multi-modal 3D understanding is a fundamental task in computer vision. Previous multi-modal fusion methods typically employ a single, dense fusion network, struggling to handle the significant heterogeneity and complexity across modalities, leading to suboptimal performance. In this paper, we propose MoE3D, which integrates Mixture of Experts (MoE) into the multi-modal learning framework. The core is that we deploy a set of specialized "expert" networks, each adept at processing a specific modality or a mode of cross-modal interaction. Specifically, the MoE-based transformer is designed to better utilize the complementary information hidden in the visual features. Information aggregation module is put forward to further enhance the fusion performance. Top-1 gating is employed to make one expert process features with expert groups, ensuring high efficiency. We further propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
