Unsupervised 3D Perception with 2D Vision-Language Distillation for   Autonomous Driving

Mahyar Najibi; Jingwei Ji; Yin Zhou; Charles R. Qi; Xinchen Yan; Scott; Ettinger; Dragomir Anguelov

arXiv:2309.14491·cs.CV·September 27, 2023

Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving

Mahyar Najibi, Jingwei Ji, Yin Zhou, Charles R. Qi, Xinchen Yan, Scott, Ettinger, Dragomir Anguelov

PDF

Open Access

TL;DR

This paper introduces a multi-modal auto labeling pipeline for 3D perception in autonomous driving, enabling open-set, unsupervised detection and classification of static and moving objects using vision-language distillation.

Contribution

It presents a novel pipeline that combines motion cues and vision-language knowledge distillation to generate open-vocabulary 3D labels without human annotations.

Findings

01

Outperforms prior unsupervised 3D perception methods on Waymo dataset

02

Handles both static and moving objects in an unsupervised manner

03

Provides open-vocabulary semantic labels for traffic participants

Abstract

Closed-set 3D perception models trained on only a pre-defined set of object categories can be inadequate for safety critical applications such as autonomous driving where new object types can be encountered after deployment. In this paper, we present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels. Our pipeline exploits motion cues inherent in point cloud sequences in combination with the freely available 2D image-text pairs to identify and track all traffic participants. Compared to the recent studies in this domain, which can only provide class-agnostic auto labels limited to moving objects, our method can handle both static and moving objects in the unsupervised manner and is able to output open-vocabulary semantic labels thanks to the proposed vision-language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Human Pose and Action Recognition