Learning from 2D: Contrastive Pixel-to-Point Knowledge Transfer for 3D Pretraining
Yueh-Cheng Liu, Yu-Kai Huang, Hung-Yueh Chiang, Hung-Ting Su, Zhe-Yu, Liu, Chin-Tang Chen, Ching-Yu Tseng, Winston H. Hsu

TL;DR
This paper introduces a novel 3D pretraining method that leverages 2D networks trained on large datasets by transferring pixel-level features to 3D point representations, improving 3D model performance without additional labeled data.
Contribution
It proposes the first method to use 2D pretrained weights for 3D network pretraining via contrastive pixel-to-point knowledge transfer, with feature alignment and resolution enhancement techniques.
Findings
Pretrained 3D models outperform from-scratch training on multiple tasks.
The method reduces reliance on expensive 3D labeled data.
Significant performance gains across various 3D downstream applications.
Abstract
Most 3D neural networks are trained from scratch owing to the lack of large-scale labeled 3D datasets. In this paper, we present a novel 3D pretraining method by leveraging 2D networks learned from rich 2D datasets. We propose the contrastive pixel-to-point knowledge transfer to effectively utilize the 2D information by mapping the pixel-level and point-level features into the same embedding space. Due to the heterogeneous nature between 2D and 3D networks, we introduce the back-projection function to align the features between 2D and 3D to make the transfer possible. Additionally, we devise an upsampling feature projection layer to increase the spatial resolution of high-level 2D feature maps, which enables learning fine-grained 3D representations. With a pretrained 2D network, the proposed pretraining process requires no additional 2D or 3D labeled data, further alleviating the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · 3D Shape Modeling and Analysis
