Style-Pro: Style-Guided Prompt Learning for Generalizable Vision-Language Models
Niloufar Alipour Talemi, Hossein Kashiani, Fatemeh Afghah

TL;DR
Style-Pro introduces a style-guided prompt learning framework for vision-language models like CLIP, effectively reducing overfitting and enhancing generalization across unseen domains and classes by synthesizing diverse style shifts.
Contribution
It proposes a novel style-guided prompt learning method with style bases and consistency constraints to improve the adaptability of pre-trained VL models.
Findings
Outperforms state-of-the-art methods on 11 benchmark datasets.
Enhances zero-shot and domain generalization capabilities.
Effectively mitigates overfitting in prompt learning.
Abstract
Pre-trained Vision-language (VL) models, such as CLIP, have shown significant generalization ability to downstream tasks, even with minimal fine-tuning. While prompt learning has emerged as an effective strategy to adapt pre-trained VL models for downstream tasks, current approaches frequently encounter severe overfitting to specific downstream data distributions. This overfitting constrains the original behavior of the VL models to generalize to new domains or unseen classes, posing a critical challenge in enhancing the adaptability and generalization of VL models. To address this limitation, we propose Style-Pro, a novel style-guided prompt learning framework that mitigates overfitting and preserves the zero-shot generalization capabilities of CLIP. Style-Pro employs learnable style bases to synthesize diverse distribution shifts, guided by two specialized loss functions that ensure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
