Enhancing Vision-Language Models Generalization via Diversity-Driven Novel Feature Synthesis
Siyuan Yan, Cheng Luo, Zhen Yu, Zongyuan Ge

TL;DR
This paper introduces LDFS, a plug-and-play feature synthesis method that enhances vision-language models' ability to generalize to unseen domains by generating diverse, high-quality features guided by language without needing additional data.
Contribution
LDFS is a novel, language-guided feature synthesis approach that improves CLIP's domain generalization through diversity promotion and feature coherence regularization.
Findings
LDFS significantly improves CLIP's zero-shot generalization on unseen domains.
LDFS outperforms existing fine-tuning strategies in domain adaptation tasks.
The method effectively synthesizes diverse features without additional data collection.
Abstract
Vision-language foundation models like CLIP have shown impressive zero-shot generalization, but finetuning on downstream datasets can cause overfitting and loss of its generalization ability on unseen domains. Although collecting additional data from new domains of interest is possible, this method is often impractical due to the challenges in obtaining annotated data. To address this, we propose a plug-and-play feature synthesis method called LDFS (Language-Guided Diverse Feature Synthesis) to synthesize new domain features and improve existing CLIP fine-tuning strategies. LDFS has three main contributions: 1) To synthesize novel domain features and promote diversity, we propose an instance-conditional feature augmentation strategy based on a text-guided feature augmentation loss. 2) To maintain feature quality after augmenting, we introduce a pairwise regularizer to preserve augmented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies
MethodsContrastive Language-Image Pre-training
