Multi-Modal Assistance for Unsupervised Domain Adaptation on Point Cloud 3D Object Detection
Shenao Zhao, Pengpeng Liang, Zhoufan Yang

TL;DR
This paper introduces MMAssist, a multi-modal approach that leverages image and text features to improve unsupervised domain adaptation in LiDAR-based 3D object detection, achieving better performance across multiple datasets.
Contribution
It proposes a novel multi-modal framework that aligns 3D features with image and text features using large vision-language models, enhancing domain adaptation for 3D detection.
Findings
Achieves improved performance on three domain adaptation tasks
Effectively fuses image, text, and point cloud features
Outperforms state-of-the-art methods in 3D object detection
Abstract
Unsupervised domain adaptation for LiDAR-based 3D object detection (3D UDA) based on the teacher-student architecture with pseudo labels has achieved notable improvements in recent years. Although it is quite popular to collect point clouds and images simultaneously, little attention has been paid to the usefulness of image data in 3D UDA when training the models. In this paper, we propose an approach named MMAssist that improves the performance of 3D UDA with multi-modal assistance. A method is designed to align 3D features between the source domain and the target domain by using image and text features as bridges. More specifically, we project the ground truth labels or pseudo labels to the images to get a set of 2D bounding boxes. For each 2D box, we extract its image feature from a pre-trained vision backbone. A large vision-language model (LVLM) is adopted to extract the box's text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
