UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by Regional Visual Language Supervision

Yuru Wang; Pei Liu; Songtao Wang; Zehan Zhang; Xinyan Lu; Changwei Cai; Hao Li; Fu Liu; Peng Jia; and Xianpeng Lang

arXiv:2412.18131·cs.CV·September 18, 2025

UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by Regional Visual Language Supervision

Yuru Wang, Pei Liu, Songtao Wang, Zehan Zhang, Xinyan Lu, Changwei Cai, Hao Li, Fu Liu, Peng Jia, and Xianpeng Lang

PDF

Open Access

TL;DR

UniPLV introduces a unified multimodal framework that leverages images, text, and point clouds to improve open-world 3D scene understanding without extensive manual annotations, achieving significant performance gains.

Contribution

The paper proposes UniPLV, a novel framework that unifies point clouds, images, and text for efficient open-world 3D scene understanding, eliminating the need for point cloud-text pair construction.

Findings

01

Achieves 15.6% and 14.8% improvements in semantic segmentation tasks.

02

Effectively aligns multimodal data through innovative distillation and matching modules.

03

Surpasses state-of-the-art methods in open-world 3D scene understanding.

Abstract

Open-world 3D scene understanding is a critical challenge that involves recognizing and distinguishing diverse objects and categories from 3D data, such as point clouds, without relying on manual annotations. Traditional methods struggle with this open-world task, especially due to the limitations of constructing extensive point cloud-text pairs and handling multimodal data effectively. In response to these challenges, we present UniPLV, a robust framework that unifies point clouds, images, and text within a single learning paradigm for comprehensive 3D scene understanding. UniPLV leverages images as a bridge to co-embed 3D points with pre-aligned images and text in a shared feature space, eliminating the need for labor-intensive point cloud-text pair crafting. Our framework achieves precise multimodal alignment through two innovative strategies: (i) Logit and feature distillation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsADaptive gradient method with the OPTimal convergence rate