Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts
Yue Zhang, Yingzhao Jian, Hehe Fan, Yi Yang, Roger Zimmermann

TL;DR
Uni3D-MoE introduces a scalable, multimodal 3D scene understanding framework using a sparse Mixture-of-Experts model that adaptively fuses diverse 3D modalities for improved interpretive accuracy.
Contribution
It presents a novel sparse MoE-based large language model that adaptively processes multiple 3D modalities at the token level, enhancing 3D scene understanding.
Findings
Outperforms existing methods on standard 3D scene benchmarks.
Effectively integrates multiple 3D modalities including RGB, depth, BEV, point clouds, and voxels.
Demonstrates flexible, task-specific modality processing through learnable routing.
Abstract
Recent advancements in multimodal large language models (MLLMs) have demonstrated considerable potential for comprehensive 3D scene understanding. However, existing approaches typically utilize only one or a limited subset of 3D modalities, resulting in incomplete representations of 3D scenes and reduced interpretive accuracy. Furthermore, different types of queries inherently depend on distinct modalities, indicating that uniform processing of all modality tokens may fail to effectively capture query-specific context. To address these challenges, we propose Uni3D-MoE, a sparse Mixture-of-Experts (MoE)-based 3D MLLM designed to enable adaptive 3D multimodal fusion. Specifically, Uni3D-MoE integrates a comprehensive set of 3D modalities, including multi-view RGB and depth images, bird's-eye-view (BEV) maps, point clouds, and voxel representations. At its core, our framework employs a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques
