Learning to Prompt with Text Only Supervision for Vision-Language Models

Muhammad Uzair Khattak; Muhammad Ferjad Naeem; Muzammal Naseer; Luc; Van Gool; Federico Tombari

arXiv:2401.02418·cs.CV·January 5, 2024·1 cites

Learning to Prompt with Text Only Supervision for Vision-Language Models

Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc, Van Gool, Federico Tombari

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel method for learning generalized prompts for vision-language models using only text data from large language models, enabling zero-shot transfer and reducing prompt engineering costs.

Contribution

It proposes the first approach to learn prompts solely from text data, enhancing transferability and reducing reliance on labeled images for vision-language models.

Findings

01

Outperforms prior ensembling methods on 4 benchmarks

02

Achieves competitive results without using labeled images

03

Enables zero-shot transfer of prompts to new classes and datasets

Abstract

Foundational vision-language models such as CLIP are becoming a new paradigm in vision, due to their excellent generalization abilities. However, adapting these models for downstream tasks while maintaining their generalization remains a challenge. In literature, one branch of methods adapts CLIP by learning prompts using visual information. While effective, most of these works require labeled data which is not practical, and often struggle to generalize towards new datasets due to over-fitting on the source data. An alternative approach resorts to training-free methods by generating class descriptions from large language models (LLMs) and perform prompt ensembling. However, these methods often generate class specific prompts that cannot be transferred to other classes, which incur higher costs by generating LLM descriptions for each class separately. In this work, we propose to combine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

muzairkhattak/protext
pytorchOfficial

Videos

Learning to Prompt with Text Only Supervision for Vision-Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training