TL;DR
MaskClu is an unsupervised pre-training method for vision transformers on 3D point clouds that combines masked point modeling, clustering, and contrastive learning to capture dense semantic features, improving performance across various 3D tasks.
Contribution
It introduces MaskClu, a novel approach integrating clustering-based reconstruction and contrastive learning for better semantic understanding in point cloud ViTs.
Findings
Outperforms existing methods on part and semantic segmentation
Achieves state-of-the-art results in 3D object detection
Enhances semantic feature richness in point cloud representations
Abstract
Vision transformers (ViTs) have recently been widely applied to 3D point cloud understanding, with masked autoencoding as the predominant pre-training paradigm. However, the challenge of learning dense and informative semantic features from point clouds via standard ViTs remains underexplored. We propose MaskClu, a novel unsupervised pre-training method for ViTs on 3D point clouds that integrates masked point modeling with clustering-based learning. MaskClu is designed to reconstruct both cluster assignments and cluster centers from masked point clouds, thus encouraging the model to capture dense semantic information. Additionally, we introduce a global contrastive learning mechanism that enhances instance-level feature learning by contrasting different masked views of the same point cloud. By jointly optimizing these complementary objectives, i.e., dense semantic reconstruction, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
