PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum
Shiqi Zhang, Sha Zhang, Jiajun Deng, Yedong Shen, Mingxiao MA, Yanyong Zhang

TL;DR
PGOV3D introduces a two-stage curriculum learning framework that leverages multi-view images, large language models, and auxiliary modules to improve open-vocabulary 3D semantic segmentation, achieving competitive results on standard benchmarks.
Contribution
The paper proposes a novel Partial-to-Global curriculum with a two-stage training strategy and multi-modal supervision for open-vocabulary 3D segmentation.
Findings
Effective partial scene pre-training with dense semantic labels.
Improved segmentation accuracy on ScanNet, ScanNet200, and S3DIS.
Robust cross-view feature consistency enforcement.
Abstract
Existing open-vocabulary 3D semantic segmentation methods typically supervise 3D segmentation models by merging text-aligned features (e.g., CLIP) extracted from multi-view images onto 3D points. However, such approaches treat multi-view images merely as intermediaries for transferring open-vocabulary information, overlooking their rich semantic content and cross-view correspondences, which limits model effectiveness. To address this, we propose PGOV3D, a novel framework that introduces a Partial-to-Global curriculum for improving open-vocabulary 3D semantic segmentation. The key innovation lies in a two-stage training strategy. In the first stage, we pre-train the model on partial scenes that provide dense semantic information but relatively simple geometry. These partial point clouds are derived from multi-view RGB-D inputs via pixel-wise depth projection. To enable open-vocabulary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization · Multimodal Machine Learning Applications
