Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant
Guofeng Mei, Luigi Riz, Yiming Wang, Fabio Poiesi

TL;DR
This paper introduces a novel vocabulary-free 3D instance segmentation method that leverages vision-language models and 2D segmenters to discover and ground semantic categories without prior vocabulary, outperforming existing methods.
Contribution
It presents the first approach to 3D instance segmentation without any vocabulary prior, using spectral clustering for superpoint merging based on semantic and mask coherence.
Findings
Outperforms existing methods on ScanNet200 and Replica datasets.
Effective in both vocabulary-free and open-vocabulary settings.
Utilizes vision-language models for semantic grounding in 3D segmentation.
Abstract
Most recent 3D instance segmentation methods are open vocabulary, offering a greater flexibility than closed-vocabulary methods. Yet, they are limited to reasoning within a specific set of concepts, \ie the vocabulary, prompted by the user at test time. In essence, these models cannot reason in an open-ended fashion, i.e., answering "List the objects in the scene.''. We introduce the first method to address 3D instance segmentation in a setting that is void of any vocabulary prior, namely a vocabulary-free setting. We leverage a large vision-language assistant and an open-vocabulary 2D instance segmenter to discover and ground semantic categories on the posed images. To form 3D instance mask, we first partition the input point cloud into dense superpoints, which are then merged into 3D instance masks. We propose a novel superpoint merging strategy via spectral clustering, accounting for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques
MethodsSparse Evolutionary Training
