CoPL: Contextual Prompt Learning for Vision-Language Understanding
Koustava Goswami, Srikrishna Karanam, Prateksha Udhayanan, K J Joseph, and Balaji Vasan Srinivasan

TL;DR
CoPL introduces a novel prompt learning framework that leverages local image features and dynamic weighting to improve vision-language understanding, especially in out-of-distribution and few-shot scenarios.
Contribution
The paper proposes Contextual Prompt Learning (CoPL), which aligns prompts with local image features and learns to reweight prompts based on image semantics, enhancing model generalization.
Findings
Significantly outperforms existing methods on standard datasets.
Improves few-shot learning performance.
Enhances out-of-distribution generalization.
Abstract
Recent advances in multimodal learning has resulted in powerful vision-language models, whose representations are generalizable across a variety of downstream tasks. Recently, their generalization ability has been further extended by incorporating trainable prompts, borrowed from the natural language processing literature. While such prompt learning techniques have shown impressive results, we identify that these prompts are trained based on global image features which limits itself in two aspects: First, by using global features, these prompts could be focusing less on the discriminative foreground image, resulting in poor generalization to various out-of-distribution test cases. Second, existing work weights all prompts equally whereas intuitively, prompts should be reweighed according to the semantics of the image. We address these as part of our proposed Contextual Prompt Learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsAttentive Walk-Aggregating Graph Neural Network
