Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving
Mahyar Najibi, Jingwei Ji, Yin Zhou, Charles R. Qi, Xinchen Yan, Scott, Ettinger, Dragomir Anguelov

TL;DR
This paper introduces a multi-modal auto labeling pipeline for 3D perception in autonomous driving, enabling open-set, unsupervised detection and classification of static and moving objects using vision-language distillation.
Contribution
It presents a novel pipeline that combines motion cues and vision-language knowledge distillation to generate open-vocabulary 3D labels without human annotations.
Findings
Outperforms prior unsupervised 3D perception methods on Waymo dataset
Handles both static and moving objects in an unsupervised manner
Provides open-vocabulary semantic labels for traffic participants
Abstract
Closed-set 3D perception models trained on only a pre-defined set of object categories can be inadequate for safety critical applications such as autonomous driving where new object types can be encountered after deployment. In this paper, we present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels. Our pipeline exploits motion cues inherent in point cloud sequences in combination with the freely available 2D image-text pairs to identify and track all traffic participants. Compared to the recent studies in this domain, which can only provide class-agnostic auto labels limited to moving objects, our method can handle both static and moving objects in the unsupervised manner and is able to output open-vocabulary semantic labels thanks to the proposed vision-language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Human Pose and Action Recognition
