3D Annotation-Free Learning by Distilling 2D Open-Vocabulary Segmentation Models for Autonomous Driving
Boyi Sun, Yuhang Liu, Xingxia Wang, Bin Tian, Long Chen, Fei-Yue Wang

TL;DR
This paper introduces AFOV, a novel annotation-free 3D segmentation framework for autonomous driving that leverages 2D open-vocabulary models and cross-modal distillation, achieving state-of-the-art results without manual annotations.
Contribution
The paper proposes a new annotation-free learning method using 2D open-vocabulary models and a novel cross-modal distillation approach for 3D point cloud segmentation.
Findings
Achieved 47.73% mIoU on nuScenes without annotations.
Surpassed previous models by 3.13% mIoU in 3D segmentation.
Performed well with minimal labeled data, reaching 51.75% mIoU with 1% data.
Abstract
Point cloud data labeling is considered a time-consuming and expensive task in autonomous driving, whereas annotation-free learning training can avoid it by learning point cloud representations from unannotated data. In this paper, we propose AFOV, a novel 3D \textbf{A}nnotation-\textbf{F}ree framework assisted by 2D \textbf{O}pen-\textbf{V}ocabulary segmentation models. It consists of two stages: In the first stage, we innovatively integrate high-quality textual and image features of 2D open-vocabulary models and propose the Tri-Modal contrastive Pre-training (TMP). In the second stage, spatial mapping between point clouds and images is utilized to generate pseudo-labels, enabling cross-modal knowledge distillation. Besides, we introduce the Approximate Flat Interaction (AFI) to address the noise during alignment and label confusion. To validate the superiority of AFOV, extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
