Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models
Xiaojie Yin, Qilong Wang, Qinghua Hu

TL;DR
This paper introduces a novel constrained prompt enhancement method that constructs comprehensive textual prompts and compact visual prompts to improve zero-shot generalization of vision-language models by better aligning visual and textual information.
Contribution
The paper proposes TGSSG and CADRS techniques to generate semantic-rich textual prompts and noise-reduced visual prompts, enhancing visual-textual alignment in VLMs.
Findings
Improved zero-shot performance on benchmark datasets.
Effective filtering of visual noise with CADRS.
Enhanced semantic coverage in textual prompts with TGSSG.
Abstract
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment due to domain gaps between pre-training and downstream tasks. Existing approaches primarily focus on text prompting with class-specific descriptions and visual-text adaptation via aligning cropped image regions with textual descriptions. However, they still face the issues of incomplete textual prompts and noisy visual prompts. In this paper, we propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment by constructing comprehensive textual prompts and compact visual prompts from the semantic perspective. Specifically, our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS). Textually, to address the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
