Point-MoE: Large-Scale Multi-Dataset Training with Mixture-of-Experts for 3D Semantic Segmentation
Xuweiyi Chen, Wentao Zhou, Aruni RoyChowdhury, Zezhou Cheng

TL;DR
This paper introduces Point-MoE, a scalable Mixture-of-Experts model for 3D semantic segmentation trained on multiple datasets without dataset labels, improving performance and generalization across diverse 3D point cloud data.
Contribution
The paper proposes Point-MoE, a novel sparse Mixture-of-Experts architecture that enables large-scale multi-dataset training for 3D segmentation without dataset supervision.
Findings
Outperforms prior methods on seen datasets
Effective in zero-shot generalization to new datasets
Demonstrates scalable training on heterogeneous 3D data
Abstract
While massively scaling both data and models have become central in NLP and 2D vision, their benefits for 3D point cloud understanding remain limited. We study the initial step of scaling 3D point cloud understanding under a realistic regime: large-scale multi-dataset joint training for 3D semantic segmentation, with no dataset labels available at training or inference time. Point clouds arise from a wide range of sensors (e.g., depth cameras, LiDAR) and scenes (\eg, indoor, outdoor), yielding heterogeneous scanning patterns, sampling densities, and semantic biases; naively mixing such datasets degrades standard models. Therefore, we introduce Point-MoE, a Mixture-of-Experts design that expands model capacity through sparsely activated expert MLPs and a lightweight top- router, allowing tokens to select specialized experts without requiring dataset supervision. Trained jointly on a…
Peer Reviews
Decision·ICLR 2026 Poster
the authors present evidence that mixed-data training enhances performance on each constituent dataset. The MoE framework appears effective at isolating domain-specific patterns while suppressing cross-dataset interference, thereby amplifying the positive effects of data scaling. This insight is valuable.
1. The shift from dense to sparse Transformers is a significant architectural choice; however, the paper would benefit from a more thorough analysis of its practical implications on point cloud data — particularly regarding training/inference latency overhead and training stability. These aspects are critical for assessing whether the reported accuracy gains come with hidden trade-offs (e.g., convergence difficulty, increased wall-clock time, or sensitivity to hyperparameters). 2. The experiment
1. Overall, the paper is well written. This makes it easy to understand how the method works and what the significance is of the findings. 2. The idea of applying an MoE block in 3D segmentation models to facilitate multi-dataset training is original and well-motivated. By allowing tokens to be dynamically routed to expert MLPs, the model can learn to adaptively handle different types of data, which is required for multi-dataset 3D segmentation due to the heterogeneity of 3D point cloud dataset
1. While Point-MoE obtains impressive results in Tab. 1, the performance of its baselines (PTv3 and PPT) is lower than reported in the original papers. For PTv3 single-dataset training, this paper reports scores of 75.0 mIoU on ScanNet val and 67.6 mIoU on S3DIS val (should this be Area5 instead of val?), while the original paper [a] reports scores of 77.5 mIoU on ScanNet val and 73.4 mIoU on S3DIS Area5. Additionally, for PPT-S multi-dataset training, this paper reports 74.7 mIoU on ScanNet and
This paper clearly defines a practical setting—multi-dataset 3D semantic segmentation without dataset labels at inference—and proposes Point-MoE, which integrates sparse Mixture-of-Experts into Point Transformer V3. Each attention output projection is replaced with lightweight expert routing (top-k), enabling automatic expert specialization across heterogeneous 3D domains. The method achieves consistent gains over PTv3 and PPT on both indoor and mixed indoor–outdoor benchmarks, while maintaining
The innovation lies mainly in applying MoE mechanisms as a routing layer into Point-TransformerV3 to 3D multi-dataset training. The interaction between CLIP-based language supervision(section 3.1 Language-guided classification) is somewhat unclear to me, and the MoE router is under-explored, like the ambiguity about how semantic alignment influences expert selection? Performance improvements, while steady (~2.5–3.6 mIoU in the large joint setup), are moderate relative to the engineering compl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Image Segmentation Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
