PointVST: Self-Supervised Pre-training for 3D Point Clouds via View-Specific Point-to-Image Translation
Qijian Zhang, Junhui Hou

TL;DR
PointVST introduces a self-supervised pre-training method for 3D point clouds by translating them into 2D images, significantly improving downstream task performance and domain transfer capabilities.
Contribution
It proposes a novel cross-modal translation pretext task for 3D point cloud pre-training, bridging the gap between 3D and 2D representations.
Findings
Outperforms state-of-the-art methods on various tasks
Demonstrates strong domain transfer ability
Shows consistent performance improvements
Abstract
The past few years have witnessed the great success and prevalence of self-supervised representation learning within the language and 2D vision communities. However, such advancements have not been fully migrated to the field of 3D point cloud learning. Different from existing pre-training paradigms designed for deep point cloud feature extractors that fall into the scope of generative modeling or contrastive learning, this paper proposes a translative pre-training framework, namely PointVST, driven by a novel self-supervised pretext task of cross-modal translation from 3D point clouds to their corresponding diverse forms of 2D rendered images. More specifically, we begin with deducing view-conditioned point-wise embeddings through the insertion of the viewpoint indicator, and then adaptively aggregate a view-specific global codeword, which can be further fed into subsequent 2D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
