Leveraging 2D-VLM for Label-Free 3D Segmentation in Large-Scale Outdoor Scene Understanding
Toshihiko Nishimura, Hirofumi Abe, Kazuhiko Murasaki, Taiga Yoshida, Ryuichi Tanida

TL;DR
This paper introduces a label-free 3D segmentation technique for large outdoor scenes that leverages 2D foundation models and multi-view aggregation, enabling open-vocabulary recognition without 3D annotations.
Contribution
It proposes a novel approach that uses 2D foundation models guided by natural language to perform 3D segmentation without any annotated 3D data.
Findings
Outperforms existing training-free methods in 3D segmentation accuracy.
Achieves comparable results to supervised methods.
Supports open-vocabulary object detection in large-scale scenes.
Abstract
This paper presents a novel 3D semantic segmentation method for large-scale point cloud data that does not require annotated 3D training data or paired RGB images. The proposed approach projects 3D point clouds onto 2D images using virtual cameras and performs semantic segmentation via a foundation 2D model guided by natural language prompts. 3D segmentation is achieved by aggregating predictions from multiple viewpoints through weighted voting. Our method outperforms existing training-free approaches and achieves segmentation accuracy comparable to supervised methods. Moreover, it supports open-vocabulary recognition, enabling users to detect objects using arbitrary text queries, thus overcoming the limitations of traditional supervised approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization · Remote Sensing and LiDAR Applications
