Heterogeneous-Modal Unsupervised Domain Adaptation via Latent Space Bridging
Jiawen Yang, Shuhao Chen, Yucong Duan, Ke Tang, Yu Zhang

TL;DR
This paper introduces a novel framework called Latent Space Bridging for unsupervised domain adaptation across different modalities, enabling effective knowledge transfer by leveraging a bridge domain with unlabeled data, and demonstrates state-of-the-art results in semantic segmentation.
Contribution
The paper proposes a new setting HMUDA and a specialized framework LSB for semantic segmentation, addressing the challenge of cross-modal domain adaptation.
Findings
LSB achieves state-of-the-art performance on six benchmark datasets.
Latent Space Bridging effectively aligns representations across heterogeneous modalities.
The dual-branch architecture with feature and domain alignment losses improves adaptation quality.
Abstract
Unsupervised domain adaptation (UDA) methods effectively bridge domain gaps but become struggled when the source and target domains belong to entirely distinct modalities. To address this limitation, we propose a novel setting called Heterogeneous-Modal Unsupervised Domain Adaptation (HMUDA), which enables knowledge transfer between completely different modalities by leveraging a bridge domain containing unlabeled samples from both modalities. To learn under the HMUDA setting, we propose Latent Space Bridging (LSB), a specialized framework designed for the semantic segmentation task. Specifically, LSB utilizes a dual-branch architecture, incorporating a feature consistency loss to align representations across modalities and a domain alignment loss to reduce discrepancies between class centroids across domains. Extensive experiments conducted on six benchmark datasets demonstrate that…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- First, the paper is well-written and easy to follow from beginning to end. - It addresses an interesting problem in the domain adaptation literature, which is not resolved, and shows good results compared to other competitors. - In addition, the authors provide a theoretical analysis of the convergence of their method.
- The authors claim that they are the first paper to study heterogeneous unsupervised domain adaptation, but different methods tackle this problem before, like the xMUDA that you compare with. A clearer positioning of the paper in the literature is needed to enhance the contribution's clarity. -Although this experimental setup has not been explored before, I see a limitation in the requirement for a third domain containing both modalities. It is not evident that such a setup would be easier to
- The proposed framework has the advantage of being applicable not only to vision-based modalities such as images and 3D point clouds but also to other modalities across different domains through the same bridging mechanism. - Unlike the xMUDA family of methods, the paper effectively addresses scenarios where the target domain contains only unlabeled single-modality data.
- The proposed method appears to rely heavily on the quantity and accuracy of the bridge-domain data. Compared to prior approaches that utilize pretrained VLMs such as CLIP for UDA/SFDA, the practical advantages and distinctions of this method should be more clearly emphasized. - Although proposing the HMUDA method is meaningful, it seems that the approach is not universally applicable to all multi-modal UDA tasks. Rather, it is limited to cases where an explicit bridge domain exists between two
- This paper introduces an interesting task: Heterogeneous-Modal Unsupervised Domain Adaptation (HMUDA). - The authors provide detailed ablation studies to analyze the effectiveness of the proposed components. - The method is simple and transudative, and the described algorithmic process is intuitive.
- In terms of technical innovation, the two components proposed in this paper, namely a feature consistency loss and a domain alignment loss, are common techniques. They are similar to the cross-modal distillation loss and class prototype loss in prior work, respectively, and have been used multiple times. - The concept of the “Bridge domain” is not clear. The source of the image and point cloud data for the Bridge domain is not clearly described in the paper. The authors need to provide further
+ In real-world scenarios, labeling 3D data is significantly more difficult and costly than labeling 2D images. From this perspective, 2D–3D UDA represents a practical and meaningful task. + The authors conduct thorough ablation studies to validate the effectiveness of the proposed modules and the chosen hyperparameters. + The paper is well written and easy to follow.
- The paper claims that 2D–3D UDA with a bridge domain is a new setting. However, the proposed bridge domain introduces a strict limitation, as it requires data from both modalities corresponding to the same input. Such an assumption makes data collection expensive and impractical, thereby weakening the claimed novelty and limiting the setting’s applicability to generalized UDA tasks. - In the experimental section, all results are limited to 2D–3D modalities. The work would be more meaningful
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Advanced Graph Neural Networks
