Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models
Baoshuo Kan, Teng Wang, Wenpeng Lu, Xiantong Zhen, Weili Guan, Feng, Zheng

TL;DR
This paper introduces a Knowledge-Aware Prompt Tuning framework for vision-language models that incorporates external knowledge to improve generalization to unseen classes, especially in few-shot image classification tasks.
Contribution
The paper proposes a novel knowledge-aware prompt tuning method that leverages external knowledge and visual cues to enhance model generalization to unseen categories.
Findings
Significant improvement in unseen class generalization
Achieves 3.22% absolute gain over state-of-the-art on new classes
Effective across 11 benchmark datasets
Abstract
Pre-trained vision-language models, e.g., CLIP, working with manually designed prompts have demonstrated great capacity of transfer learning. Recently, learnable prompts achieve state-of-the-art performance, which however are prone to overfit to seen classes, failing to generalize to unseen classes. In this paper, we propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models. Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects. Specifically, we design two complementary types of knowledge-aware prompts for the text encoder to leverage the distinctive characteristics of category-related external knowledge. The discrete prompt extracts the key information from descriptions of an object category, and the learned continuous prompt captures overall contexts. We further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
