Unified modality separation: A vision-language framework for unsupervised domain adaptation
Xinyao Li, Jingjing Li, Zhekai Du, Lei Zhu, Heng Tao Shen

TL;DR
This paper introduces a unified modality separation framework for unsupervised domain adaptation using vision-language models, effectively disentangling modality-specific and invariant features to improve target domain performance.
Contribution
It proposes a novel modality separation approach that handles modality-specific and invariant features separately, with adaptive weighting and a new discrepancy metric for better domain adaptation.
Findings
Achieves up to 9% performance improvement.
Provides 9 times computational efficiency.
Demonstrates effectiveness across multiple datasets and backbones.
Abstract
Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as modality gap. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
