Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

Cheng-En Wu; Yu Tian; Haichao Yu; Heng Wang; Pedro Morgado; Yu Hen Hu,; Linjie Yang

arXiv:2307.11978·cs.CV·July 25, 2023·1 cites

Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

Cheng-En Wu, Yu Tian, Haichao Yu, Heng Wang, Pedro Morgado, Yu Hen Hu,, Linjie Yang

PDF

Open Access 1 Repo 1 Video

TL;DR

Prompt tuning for vision-language models like CLIP is surprisingly robust to noisy labels due to strong regularization from fixed class tokens and the powerful pre-trained embeddings, enabling effective self-tuning.

Contribution

This paper uncovers the reasons behind the robustness of prompt tuning to noisy labels and demonstrates self-tuning methods that improve accuracy without supervision.

Findings

01

Prompt tuning is highly robust to label noise.

02

Fixed class tokens act as regularizers reducing noisy gradient impact.

03

Pre-trained embeddings provide strong prior knowledge for classification.

Abstract

Vision-language models such as CLIP learn a generic text-image embedding from large-scale training data. A vision-language model can be adapted to a new classification task through few-shot prompt tuning. We find that such a prompt tuning process is highly robust to label noises. This intrigues us to study the key reasons contributing to the robustness of the prompt tuning paradigm. We conducted extensive experiments to explore this property and find the key factors are: 1) the fixed classname tokens provide a strong regularization to the optimization of the model, reducing gradients induced by the noisy samples; 2) the powerful pre-trained image-text embedding that is learned from diverse and generic web data provides strong prior knowledge for image classification. Further, we demonstrate that noisy zero-shot predictions from CLIP can be used to tune its own prompt, significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cewu/ptnl
pytorchOfficial

Videos

Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training