Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts

Yue Zhang; Yingzhao Jian; Hehe Fan; Yi Yang; Roger Zimmermann

arXiv:2505.21079·cs.CV·May 28, 2025

Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts

Yue Zhang, Yingzhao Jian, Hehe Fan, Yi Yang, Roger Zimmermann

PDF

Open Access

TL;DR

Uni3D-MoE introduces a scalable, multimodal 3D scene understanding framework using a sparse Mixture-of-Experts model that adaptively fuses diverse 3D modalities for improved interpretive accuracy.

Contribution

It presents a novel sparse MoE-based large language model that adaptively processes multiple 3D modalities at the token level, enhancing 3D scene understanding.

Findings

01

Outperforms existing methods on standard 3D scene benchmarks.

02

Effectively integrates multiple 3D modalities including RGB, depth, BEV, point clouds, and voxels.

03

Demonstrates flexible, task-specific modality processing through learnable routing.

Abstract

Recent advancements in multimodal large language models (MLLMs) have demonstrated considerable potential for comprehensive 3D scene understanding. However, existing approaches typically utilize only one or a limited subset of 3D modalities, resulting in incomplete representations of 3D scenes and reduced interpretive accuracy. Furthermore, different types of queries inherently depend on distinct modalities, indicating that uniform processing of all modality tokens may fail to effectively capture query-specific context. To address these challenges, we propose Uni3D-MoE, a sparse Mixture-of-Experts (MoE)-based 3D MLLM designed to enable adaptive 3D multimodal fusion. Specifically, Uni3D-MoE integrates a comprehensive set of 3D modalities, including multi-view RGB and depth images, bird's-eye-view (BEV) maps, point clouds, and voxel representations. At its core, our framework employs a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques