LidarCLIP or: How I Learned to Talk to Point Clouds
Georg Hess, Adam Tonderski, Christoffer Petersson, Kalle {\AA}str\"om,, Lennart Svensson

TL;DR
LidarCLIP introduces a novel method to relate lidar point clouds to text and images by mapping them into a shared CLIP embedding space, enabling zero-shot classification, retrieval, and cross-modal applications in autonomous driving.
Contribution
The paper presents LidarCLIP, the first model to connect lidar data with CLIP embeddings, facilitating cross-modal retrieval and zero-shot tasks without additional training.
Findings
LidarCLIP achieves comparable lidar and image retrieval performance.
Combining lidar and image features improves detection in challenging scenarios.
LidarCLIP significantly outperforms previous CLIP-based methods for point cloud classification.
Abstract
Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL-E 2, and Stable Diffusion. However, the connection between text and other visual modalities, such as lidar data, has received less attention, prohibited by the lack of text-lidar datasets. In this work, we propose LidarCLIP, a mapping from automotive point clouds to a pre-existing CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder with the image CLIP embeddings, effectively relating text and lidar data with the image domain as an intermediary. We show the effectiveness of LidarCLIP by demonstrating that lidar-based retrieval is generally on par with image-based retrieval, but with complementary strengths and weaknesses. By combining image and lidar features, we improve upon both single-modality methods and enable a targeted search for challenging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
LidarCLIP or: How I Learned To Talk to Point Clouds· youtube
Taxonomy
TopicsAdvanced Neural Network Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training · Diffusion
