MoE3D: Mixture of Experts meets Multi-Modal 3D Understanding

Yu Li; Yuenan Hou; Yingmei Wei; Xinge Zhu; Yuexin Ma; Wenqi Shao; Yanming Guo

arXiv:2511.22103·cs.CV·December 1, 2025

MoE3D: Mixture of Experts meets Multi-Modal 3D Understanding

Yu Li, Yuenan Hou, Yingmei Wei, Xinge Zhu, Yuexin Ma, Wenqi Shao, Yanming Guo

PDF

Open Access

TL;DR

MoE3D introduces a Mixture of Experts framework for multi-modal 3D understanding, enhancing fusion performance and efficiency through specialized experts, progressive pre-training, and a novel information aggregation module, achieving state-of-the-art results.

Contribution

The paper presents a novel MoE-based transformer architecture with a progressive pre-training strategy for improved multi-modal 3D understanding.

Findings

01

Achieves 6.1 mIoU improvement on Multi3DRefer.

02

Outperforms previous methods across four 3D understanding tasks.

03

Demonstrates effective modality-specific expert specialization.

Abstract

Multi-modal 3D understanding is a fundamental task in computer vision. Previous multi-modal fusion methods typically employ a single, dense fusion network, struggling to handle the significant heterogeneity and complexity across modalities, leading to suboptimal performance. In this paper, we propose MoE3D, which integrates Mixture of Experts (MoE) into the multi-modal learning framework. The core is that we deploy a set of specialized "expert" networks, each adept at processing a specific modality or a mode of cross-modal interaction. Specifically, the MoE-based transformer is designed to better utilize the complementary information hidden in the visual features. Information aggregation module is put forward to further enhance the fusion performance. Top-1 gating is employed to make one expert process features with expert groups, ensuring high efficiency. We further propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning