All in One: Visual-Description-Guided Unified Point Cloud Segmentation

Zongyan Han; Mohamed El Amine Boudjoghra; Jiahua Dong; Jinhong Wang; Rao Muhammad Anwer

arXiv:2507.05211·cs.CV·July 28, 2025

All in One: Visual-Description-Guided Unified Point Cloud Segmentation

Zongyan Han, Mohamed El Amine Boudjoghra, Jiahua Dong, Jinhong Wang, Rao Muhammad Anwer

PDF

Open Access

TL;DR

This paper introduces VDG-Uni3DSeg, a novel 3D point cloud segmentation framework that leverages pre-trained vision-language and large language models to incorporate multimodal cues, significantly improving fine-grained semantic and instance segmentation.

Contribution

It proposes a new framework integrating multimodal models and novel loss functions for enhanced 3D point cloud segmentation, addressing limitations of existing methods.

Findings

01

Achieves state-of-the-art results in semantic, instance, and panoptic segmentation.

02

Effectively incorporates multimodal knowledge from internet data.

03

Demonstrates scalable and practical 3D understanding solutions.

Abstract

Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization · Multimodal Machine Learning Applications