SemPT: Semantic Prompt Tuning for Vision-Language Models
Xiao Shi, Yangjun Ou, Zhenzhong Chen

TL;DR
SemPT introduces a semantic prompt tuning framework that leverages shared attribute-level knowledge and a two-step prompting strategy to improve transferability and generalization of vision-language models to unseen categories.
Contribution
The paper proposes a novel semantic prompt tuning method that enhances transferability by extracting shared attributes and aligning image and text embeddings for better unseen category recognition.
Findings
Achieves state-of-the-art results on 15 benchmark datasets.
Improves generalization to unseen categories in zero-shot and few-shot settings.
Effectively balances discrimination and transferability through attribute-enhanced embeddings.
Abstract
Visual transfer learning for unseen categories presents an active research topic yet a challenging task, due to the inherent conflict between preserving category-specific representations and acquiring transferable knowledge. Vision-Language Models (VLMs) pre-trained on large amounts of image-text pairs offer a promising solution. However, existing prompt tuning methods rely on sparse category labels or disparate LLM-generated descriptions, which fragment knowledge representation and hinder transferability. To address this limitation, we introduce Semantic Prompt Tuning (SemPT), a novel framework that tackles the generalization challenge by leveraging shared attribute-level knowledge across categories. Specifically, SemPT adopts a two-step prompting strategy to guide LLM in extracting shared visual attributes and generating attribute-level descriptions, capturing transferable semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
