Can Better Text Semantics in Prompt Tuning Improve VLM Generalization?
Hari Chandana Kuchibhotla, Sai Srinivas Kancheti, Abbavaram Gowtham, Reddy, Vineeth N Balasubramanian

TL;DR
This paper proposes a prompt-tuning method using class descriptions from Large Language Models to improve the generalization of vision-language models, especially in low-shot and large class scenarios.
Contribution
It introduces a novel prompt-tuning approach that leverages LLM-generated class descriptions to enhance semantic alignment and generalization in VLMs.
Findings
Outperforms existing methods on 11 benchmark datasets.
Significant improvements in low-shot and large class space scenarios.
Enhanced alignment of image and text features through class descriptions.
Abstract
Going beyond mere fine-tuning of vision-language models (VLMs), learnable prompt tuning has emerged as a promising, resource-efficient alternative. Despite their potential, effectively learning prompts faces the following challenges: (i) training in a low-shot scenario results in overfitting, limiting adaptability, and yielding weaker performance on newer classes or datasets; (ii) prompt-tuning's efficacy heavily relies on the label space, with decreased performance in large class spaces, signaling potential gaps in bridging image and class concepts. In this work, we investigate whether better text semantics can help address these concerns. In particular, we introduce a prompt-tuning method that leverages class descriptions obtained from Large Language Models (LLMs). These class descriptions are used to bridge image and text modalities. Our approach constructs part-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRough Sets and Fuzzy Logic · Stock Market Forecasting Methods
