CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning
Runjian Chen, Hang Zhang, Avinash Ravichandran, Hyoungseob Park, Wenqi Shao, Alex Wong, Ping Luo

TL;DR
CLAP introduces a joint unsupervised pre-training approach for images and point clouds that leverages curvature sampling and prototype learning to enhance 3D perception, significantly outperforming previous methods.
Contribution
The paper proposes a novel differentiable-rendering-based pre-training method that jointly learns from images and point clouds using curvature sampling and learnable prototypes.
Findings
Achieves up to 100% performance improvement over previous SOTA methods.
Effectively exploits the complementarity of image semantics and 3D structure.
Demonstrates strong results on NuScenes and Waymo datasets.
Abstract
Unsupervised 3D representation learning reduces the burden of labeling multimodal 3D data for fusion perception tasks. Among different pre-training paradigms, differentiable-rendering-based methods have shown most promise. However, existing works separately conduct pre-training for each modalities due to computational costs of processing large point clouds with images. As such, mutual benefit of high-level semantics (from image) and 3D structure (from point cloud) has not been exploited. To address this gap, we propose a joint unsupervised differentiable-rendering-based pre-training method for images and point clouds, termed CLAP, short for Curvature sampLing and leArnable Prototype. Specifically, our method overcomes the computational hurdle by Curvature Sampling to select the more informative points/pixels for pre-training. To uncover the performance benefits brought by their…
Peer Reviews
Decision·ICLR 2026 Poster
The manuscript is well-written, and the motivation for the work is clearly established. The authors design a curvature sampling strategy to identify informative points and pixels for sampling. Learnable prototypes are utilized to establish a common feature space, and an Expectation-Maximization approach is employed to optimize these prototypes, enabling them to represent distinct parts of the 3D scene.
The coverage of related work and compared methods lacks comprehensiveness. Notably, several recent methods from the "3DTrans" GitHub repository, which reportedly achieve strong performance, are neither discussed nor included in the experimental comparisons. When compared to other outdoor self-supervised learning (SSL) methods that also employ differentiable rendering, the primary novelty of the proposed CLAP framework appears to be the Curvature Sampling strategy. The use of prototype learning
1.Strong scalability of the method: In few-shot fine-tuning scenarios, CLAP’s performance improvement increases as the amount of training data decreases, demonstrating strong scalability potential. 2.Practical engineering significance: CLAP reduces the computational cost of joint unsupervised pre-training for image and point cloud modalities, facilitating information interaction between different modalities. 3.Experimental validation of effectiveness: On benchmarks (e.g., NuScenes, Waymo), CLA
1.Insufficient analysis of results:Further analysis is needed to determine whether CLAP’s advantages lie in handling complex scenarios or achieving broad accuracy gains. Notably, its inferior performance on specific classes (vs. UniPAD/PPKT in Table 1) investigating the underlying causes. 2.Memory efficiency requires further proof:UniPAD separately pre-train the image and point cloud encoders due to GPU memory constraints. Although curvature sampling is introduced to mitigate this issue, the ab
- Clear motivation - Curvature Sampling is intuitive; it prioritizes complex regions (e.g., vehicles, edges) while maintaining low computational overhead in theory. - Prototype Learning sounds crucial to align and interact between LiDAR and image modalities, improving cross-modal understanding. - Experiments include both NuScenes (5%) and Waymo (1%), with detailed scaling analyses (0.5–5%) showing gains. - Ablation results show contributions of each component.
- The authors still focus mainly on few-shot downstream training (NuScenes 5%, Waymo 1%), which, while useful for showing sample efficiency, is non-standard in representation learning. Full-data fine-tuning results would better demonstrate scalability and practical performance benefits. - Although Curvature Sampling is motivated as memory-efficient, the paper does not explicitly quantify memory or runtime savings compared to UniPAD’s “Memory-friendly Ray Sampling.” - The prototype mechanism is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging and Analysis · 3D Shape Modeling and Analysis · Domain Adaptation and Few-Shot Learning
