Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization
De Cheng, Zhipeng Xu, Xinyang Jiang, Dongsheng Li, Nannan Wang, Xinbo Gao

TL;DR
This paper introduces a novel framework for domain generalization that leverages language-guided disentanglement of visual prompts and representation alignment, improving model robustness across unseen domains.
Contribution
It proposes a text feature-guided visual prompt tuning framework combined with Worst Explicit Representation Alignment (WERA) to enhance domain-invariant features in visual models.
Findings
Outperforms state-of-the-art DG methods on multiple datasets
Effective disentanglement of text and visual features improves generalization
Incorporating stylized augmentations enhances domain diversity and robustness
Abstract
Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains. Notably, recent advances in pre-trained Visual Foundation Models (VFMs), such as CLIP, have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models. Despite the increasing attention toward VFM-based domain prompt tuning within DG, the effective design of prompts capable of disentangling invariant features across diverse domains remains a critical challenge. In this paper, we propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM. Noting that the text modality of VFMs is naturally easier to disentangle, we introduce a novel framework for text feature-guided visual prompt tuning. This framework first automatically disentangles the text prompt using a large language model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training · Sparse Evolutionary Training
