Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation
Zhaochong An, Guolei Sun, Yun Liu, Runjia Li, Min Wu, Ming-Ming Cheng,, Ender Konukoglu, Serge Belongie

TL;DR
This paper introduces a multimodal approach to few-shot 3D point cloud segmentation, leveraging textual and 2D image data to improve generalization and performance on standard datasets.
Contribution
The paper proposes a novel multimodal FS-PCS framework with modules for correlation and semantic fusion, along with a test-time calibration technique, advancing beyond unimodal methods.
Findings
Significant performance improvements on S3DIS and ScanNet datasets.
Effective utilization of textual and 2D image modalities for 3D segmentation.
Demonstrated benefits of multimodal information in few-shot learning scenarios.
Abstract
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal annotated support samples. While existing FS-PCS methods have shown promise, they primarily focus on unimodal point cloud inputs, overlooking the potential benefits of leveraging multimodal information. In this paper, we address this gap by introducing a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality. Under this easy-to-achieve setup, we present the MultiModal Few-Shot SegNet (MM-FSS), a model effectively harnessing complementary information from multiple modalities. MM-FSS employs a shared backbone with two heads to extract intermodal and unimodal visual features, and a pretrained text encoder to generate text embeddings. To fully exploit the multimodal information, we propose a Multimodal Correlation Fusion (MCF) module…
Peer Reviews
Decision·ICLR 2025 Spotlight
+ The paper is well-written and easy to follow, with clear motivations for each proposed design. + It introduces the first multimodal few-shot 3D segmentation setting, which is cost-free and doesn’t increase labeling effort, consistent with standard few-shot tasks. The idea of leveraging available free modalities to help few-shot learning could provide valuable insights for the field. + The proposed model is novel and well-justified. The MCS and MSF modules use different modalities by exploiting
Certain paragraphs could be more concise. For example, the paragraph (lines 235-240) explains the two training steps. Then the paragraph (lines 259-266) writes about these two steps again. These two parts could be merged to improve conciseness.
- The writing is good, making this paper easy to follow. - The performance is good. - The experiments are adequate. - The introduction of the multimodality information is useful.
- How the design of the MSF module can achieve the improvement for the correlation. Please give more explanations.
1. It makes sense to use complementary information from multimodal inputs for point cloud segmentation. 2. The ablation studies on ScanNet prove the effectiveness of the proposed models, including MCF, MSF, and TACC. 3. The features from 2D images are learned during the pre-training step. In this way, the proposed method does not rely on 2D inputs during inference. 4. The proposed MM-FSS achieves promising performance on two benchmark datasets compared to the state-of-the-art methods.
1. The idea of using multimodal information for 3D point cloud segmentation is not new. As noticed by the authors, such an idea has been explored for fully supervised or more challenging tasks like open-vocabulary segmentation. It is unclear why the prior few-shot methods only use unimodal information. It would be better to provide more discussions on the difficulties or challenges when using multimodal features for few-shot segmentation. 2. The claim about “cost-free” is ambiguous as “cost” ma
Code & Models
Videos
Taxonomy
Topics3D Surveying and Cultural Heritage · Remote Sensing and LiDAR Applications · 3D Shape Modeling and Analysis
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Max Pooling · Convolution · Kaiming Initialization · Focus · Softmax · SegNet
