Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models
Zhimin Chen, Longlong Jing, Yingwei Li, Bing Li

TL;DR
This paper introduces Bridge3D, a novel approach that leverages foundation models to pre-train 3D scene understanding models, effectively bridging the domain gap and significantly improving 3D detection and segmentation performance.
Contribution
Bridge3D is the first method to utilize foundation model features, semantic masks, and captions for pre-training 3D models, enhancing scene understanding tasks.
Findings
Achieves a 6.3% improvement on ScanNet for 3D detection.
Outperforms existing state-of-the-art methods in 3D segmentation.
Effectively bridges the domain gap using multi-level knowledge distillation.
Abstract
Foundation models have achieved remarkable results in 2D and language tasks like image segmentation, object detection, and visual-language understanding. However, their potential to enrich 3D scene representation learning is largely untapped due to the existence of the domain gap. In this work, we propose an innovative methodology called Bridge3D to address this gap by pre-training 3D models using features, semantic masks, and captions sourced from foundation models. Specifically, our method employs semantic masks from foundation models to guide the masking and reconstruction process for the masked autoencoder, enabling more focused attention on foreground representations. Moreover, we bridge the 3D-text gap at the scene level using image captioning foundation models, thereby facilitating scene-level knowledge distillation. We further extend this bridging effort by introducing an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Human Pose and Action Recognition
MethodsKnowledge Distillation
