Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection
Mehar Khurana, Neehar Peri, James Hays, Deva Ramanan

TL;DR
This paper introduces a shelf-supervised pre-training method for 3D object detection that leverages image-based foundation models to generate pseudo-labels from paired RGB and LiDAR data, improving detection accuracy especially with limited labeled data.
Contribution
It proposes a novel shelf-supervised approach using off-the-shelf image foundation models to generate zero-shot 3D bounding boxes for pre-training, enhancing semi-supervised detection performance.
Findings
Significantly improves detection accuracy over prior self-supervised methods.
Effective for LiDAR-only, RGB-only, and multi-modal detectors.
Demonstrates superior results on nuScenes and WOD datasets in limited data scenarios.
Abstract
State-of-the-art 3D object detectors are often trained on massive labeled datasets. However, annotating 3D bounding boxes remains prohibitively expensive and time-consuming, particularly for LiDAR. Instead, recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels. Contemporary methods adapt best-practices for self-supervised learning from the image domain to point clouds (such as contrastive learning). However, publicly available 3D datasets are considerably smaller and less diverse than those used for image-based self-supervised learning, limiting their effectiveness. We do note, however, that such 3D data is naturally collected in a multimodal fashion, often paired with images. Rather than pre-training with only self-supervised objectives, we argue that it is better to bootstrap point cloud representations using…
Peer Reviews
Decision·CoRL 2024
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications
