Multi-Modal Assistance for Unsupervised Domain Adaptation on Point Cloud 3D Object Detection

Shenao Zhao; Pengpeng Liang; Zhoufan Yang

arXiv:2511.07966·cs.CV·November 12, 2025

Multi-Modal Assistance for Unsupervised Domain Adaptation on Point Cloud 3D Object Detection

Shenao Zhao, Pengpeng Liang, Zhoufan Yang

PDF

Open Access 1 Video

TL;DR

This paper introduces MMAssist, a multi-modal approach that leverages image and text features to improve unsupervised domain adaptation in LiDAR-based 3D object detection, achieving better performance across multiple datasets.

Contribution

It proposes a novel multi-modal framework that aligns 3D features with image and text features using large vision-language models, enhancing domain adaptation for 3D detection.

Findings

01

Achieves improved performance on three domain adaptation tasks

02

Effectively fuses image, text, and point cloud features

03

Outperforms state-of-the-art methods in 3D object detection

Abstract

Unsupervised domain adaptation for LiDAR-based 3D object detection (3D UDA) based on the teacher-student architecture with pseudo labels has achieved notable improvements in recent years. Although it is quite popular to collect point clouds and images simultaneously, little attention has been paid to the usefulness of image data in 3D UDA when training the models. In this paper, we propose an approach named MMAssist that improves the performance of 3D UDA with multi-modal assistance. A method is designed to align 3D features between the source domain and the target domain by using image and text features as bridges. More specifically, we project the ground truth labels or pseudo labels to the images to get a set of 2D bounding boxes. For each 2D box, we extract its image feature from a pre-trained vision backbone. A large vision-language model (LVLM) is adopted to extract the box's text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Multi-Modal Assistance for Unsupervised Domain Adaptation on Point Cloud 3D Object Detection· underline

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications