Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models

Xiaojie Yin; Qilong Wang; Qinghua Hu

arXiv:2508.17417·cs.CV·August 26, 2025

Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models

Xiaojie Yin, Qilong Wang, Qinghua Hu

PDF

TL;DR

This paper introduces a novel constrained prompt enhancement method that constructs comprehensive textual prompts and compact visual prompts to improve zero-shot generalization of vision-language models by better aligning visual and textual information.

Contribution

The paper proposes TGSSG and CADRS techniques to generate semantic-rich textual prompts and noise-reduced visual prompts, enhancing visual-textual alignment in VLMs.

Findings

01

Improved zero-shot performance on benchmark datasets.

02

Effective filtering of visual noise with CADRS.

03

Enhanced semantic coverage in textual prompts with TGSSG.

Abstract

Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment due to domain gaps between pre-training and downstream tasks. Existing approaches primarily focus on text prompting with class-specific descriptions and visual-text adaptation via aligning cropped image regions with textual descriptions. However, they still face the issues of incomplete textual prompts and noisy visual prompts. In this paper, we propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment by constructing comprehensive textual prompts and compact visual prompts from the semantic perspective. Specifically, our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS). Textually, to address the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.