Enhancing Vision-Language Models Generalization via Diversity-Driven   Novel Feature Synthesis

Siyuan Yan; Cheng Luo; Zhen Yu; Zongyuan Ge

arXiv:2405.02586·cs.CV·August 14, 2024

Enhancing Vision-Language Models Generalization via Diversity-Driven Novel Feature Synthesis

Siyuan Yan, Cheng Luo, Zhen Yu, Zongyuan Ge

PDF

Open Access

TL;DR

This paper introduces LDFS, a plug-and-play feature synthesis method that enhances vision-language models' ability to generalize to unseen domains by generating diverse, high-quality features guided by language without needing additional data.

Contribution

LDFS is a novel, language-guided feature synthesis approach that improves CLIP's domain generalization through diversity promotion and feature coherence regularization.

Findings

01

LDFS significantly improves CLIP's zero-shot generalization on unseen domains.

02

LDFS outperforms existing fine-tuning strategies in domain adaptation tasks.

03

The method effectively synthesizes diverse features without additional data collection.

Abstract

Vision-language foundation models like CLIP have shown impressive zero-shot generalization, but finetuning on downstream datasets can cause overfitting and loss of its generalization ability on unseen domains. Although collecting additional data from new domains of interest is possible, this method is often impractical due to the challenges in obtaining annotated data. To address this, we propose a plug-and-play feature synthesis method called LDFS (Language-Guided Diverse Feature Synthesis) to synthesize new domain features and improve existing CLIP fine-tuning strategies. LDFS has three main contributions: 1) To synthesize novel domain features and promote diversity, we propose an instance-conditional feature augmentation strategy based on a text-guided feature augmentation loss. 2) To maintain feature quality after augmenting, we introduce a pairwise regularizer to preserve augmented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies

MethodsContrastive Language-Image Pre-training