All in One: Visual-Description-Guided Unified Point Cloud Segmentation
Zongyan Han, Mohamed El Amine Boudjoghra, Jiahua Dong, Jinhong Wang, Rao Muhammad Anwer

TL;DR
This paper introduces VDG-Uni3DSeg, a novel 3D point cloud segmentation framework that leverages pre-trained vision-language and large language models to incorporate multimodal cues, significantly improving fine-grained semantic and instance segmentation.
Contribution
It proposes a new framework integrating multimodal models and novel loss functions for enhanced 3D point cloud segmentation, addressing limitations of existing methods.
Findings
Achieves state-of-the-art results in semantic, instance, and panoptic segmentation.
Effectively incorporates multimodal knowledge from internet data.
Demonstrates scalable and practical 3D understanding solutions.
Abstract
Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization · Multimodal Machine Learning Applications
