Vocabulary-Free 3D Instance Segmentation with Vision and Language   Assistant

Guofeng Mei; Luigi Riz; Yiming Wang; Fabio Poiesi

arXiv:2408.10652·cs.CV·March 31, 2025

Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

Guofeng Mei, Luigi Riz, Yiming Wang, Fabio Poiesi

PDF

Open Access

TL;DR

This paper introduces a novel vocabulary-free 3D instance segmentation method that leverages vision-language models and 2D segmenters to discover and ground semantic categories without prior vocabulary, outperforming existing methods.

Contribution

It presents the first approach to 3D instance segmentation without any vocabulary prior, using spectral clustering for superpoint merging based on semantic and mask coherence.

Findings

01

Outperforms existing methods on ScanNet200 and Replica datasets.

02

Effective in both vocabulary-free and open-vocabulary settings.

03

Utilizes vision-language models for semantic grounding in 3D segmentation.

Abstract

Most recent 3D instance segmentation methods are open vocabulary, offering a greater flexibility than closed-vocabulary methods. Yet, they are limited to reasoning within a specific set of concepts, \ie the vocabulary, prompted by the user at test time. In essence, these models cannot reason in an open-ended fashion, i.e., answering "List the objects in the scene.''. We introduce the first method to address 3D instance segmentation in a setting that is void of any vocabulary prior, namely a vocabulary-free setting. We leverage a large vision-language assistant and an open-vocabulary 2D instance segmenter to discover and ground semantic categories on the posed images. To form 3D instance mask, we first partition the input point cloud into dense superpoints, which are then merged into 3D instance masks. We propose a novel superpoint merging strategy via spectral clustering, accounting for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques

MethodsSparse Evolutionary Training