PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum

Shiqi Zhang; Sha Zhang; Jiajun Deng; Yedong Shen; Mingxiao MA; Yanyong Zhang

arXiv:2506.23607·cs.CV·July 1, 2025

PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum

Shiqi Zhang, Sha Zhang, Jiajun Deng, Yedong Shen, Mingxiao MA, Yanyong Zhang

PDF

Open Access

TL;DR

PGOV3D introduces a two-stage curriculum learning framework that leverages multi-view images, large language models, and auxiliary modules to improve open-vocabulary 3D semantic segmentation, achieving competitive results on standard benchmarks.

Contribution

The paper proposes a novel Partial-to-Global curriculum with a two-stage training strategy and multi-modal supervision for open-vocabulary 3D segmentation.

Findings

01

Effective partial scene pre-training with dense semantic labels.

02

Improved segmentation accuracy on ScanNet, ScanNet200, and S3DIS.

03

Robust cross-view feature consistency enforcement.

Abstract

Existing open-vocabulary 3D semantic segmentation methods typically supervise 3D segmentation models by merging text-aligned features (e.g., CLIP) extracted from multi-view images onto 3D points. However, such approaches treat multi-view images merely as intermediaries for transferring open-vocabulary information, overlooking their rich semantic content and cross-view correspondences, which limits model effectiveness. To address this, we propose PGOV3D, a novel framework that introduces a Partial-to-Global curriculum for improving open-vocabulary 3D semantic segmentation. The key innovation lies in a two-stage training strategy. In the first stage, we pre-train the model on partial scenes that provide dense semantic information but relatively simple geometry. These partial point clouds are derived from multi-view RGB-D inputs via pixel-wise depth projection. To enable open-vocabulary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization · Multimodal Machine Learning Applications