Fine-grained Image-to-LiDAR Contrastive Distillation with Visual Foundation Models
Yifan Zhang, Junhui Hou

TL;DR
This paper introduces a novel contrastive distillation method leveraging Visual Foundation Models and structured feature spaces to improve 3D representation learning from images and LiDAR data, addressing semantic inconsistency issues.
Contribution
It proposes using VFMs for semantic labeling, von Mises-Fisher distributions for feature structuring, and adaptive sampling to enhance image-to-LiDAR contrastive learning.
Findings
Outperforms existing methods in downstream tasks
Mitigates semantic feature conflicts in contrastive learning
Provides a scalable framework for 3D representation enhancement
Abstract
Contrastive image-to-LiDAR knowledge transfer, commonly used for learning 3D representations with synchronized images and point clouds, often faces a self-conflict dilemma. This issue arises as contrastive losses unintentionally dissociate features of unmatched points and pixels that share semantic labels, compromising the integrity of learned representations. To overcome this, we harness Visual Foundation Models (VFMs), which have revolutionized the acquisition of pixel-level semantics, to enhance 3D representation learning. Specifically, we utilize off-the-shelf VFMs to generate semantic labels for weakly-supervised pixel-to-point contrastive distillation. Additionally, we employ von Mises-Fisher distributions to structure the feature space, ensuring semantic embeddings within the same class remain consistent across varying inputs. Furthermore, we adapt sampling probabilities of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection
