Model2Scene: Learning 3D Scene Representation via Contrastive Language-CAD Models Pre-training
Runnan Chen, Xinge Zhu, Nenglun Chen, Dawei Wang, Wei Li, Yuexin Ma,, Ruigang Yang, Tongliang Liu, Wenping Wang

TL;DR
Model2Scene introduces a novel pre-training paradigm for 3D scene understanding that leverages CAD models and language, addressing domain gaps with data augmentation and a new feature regularization, enabling effective zero-shot and label-efficient tasks.
Contribution
It proposes a contrastive learning framework using CAD models and language for 3D scene representation, with a novel Deep Convex-hull Regularization to reduce domain gaps.
Findings
Achieves 46.08% mAP on ScanNet for label-free object detection.
Enables zero-shot 3D semantic segmentation.
Improves label-efficient 3D scene perception.
Abstract
Current successful methods of 3D scene perception rely on the large-scale annotated point cloud, which is tedious and expensive to acquire. In this paper, we propose Model2Scene, a novel paradigm that learns free 3D scene representation from Computer-Aided Design (CAD) models and languages. The main challenges are the domain gaps between the CAD models and the real scene's objects, including model-to-scene (from a single model to the scene) and synthetic-to-real (from synthetic model to real scene's object). To handle the above challenges, Model2Scene first simulates a crowded scene by mixing data-augmented CAD models. Next, we propose a novel feature regularization operation, termed Deep Convex-hull Regularization (DCR), to project point features into a unified convex hull space, reducing the domain gap. Ultimately, we impose contrastive loss on language embedding and the point…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · 3D Surveying and Cultural Heritage · Visual Attention and Saliency Detection
