Revisiting Prompt Pretraining of Vision-Language Models

Zhenyuan Chen; Lingfeng Yang; Shuo Chen; Zhaowei Chen; Jiajun Liang,; Xiang Li

arXiv:2409.06166·cs.CV·September 11, 2024

Revisiting Prompt Pretraining of Vision-Language Models

Zhenyuan Chen, Lingfeng Yang, Shuo Chen, Zhaowei Chen, Jiajun Liang,, Xiang Li

PDF

Open Access

TL;DR

This paper introduces Revisiting Prompt Pretraining (RPP), a framework that enhances prompt learning for vision-language models by improving fitting capacity and generalization through unshared prompt structures and soft label supervision, achieving state-of-the-art results.

Contribution

The paper proposes a novel RPP framework that improves prompt pretraining by unsharing prompt components and leveraging soft labels from a CLIP teacher, enhancing transferability and performance.

Findings

01

RPP achieves SOTA performance across various benchmarks.

02

Unshared prompt structures increase model fitting capacity.

03

Soft label supervision improves generalization.

Abstract

Prompt learning is an effective method to customize Vision-Language Models (VLMs) for various downstream tasks, involving tuning very few parameters of input prompt tokens. Recently, prompt pretraining in large-scale dataset (e.g., ImageNet-21K) has played a crucial role in prompt learning for universal visual discrimination. However, we revisit and observe that the limited learnable prompts could face underfitting risks given the extensive images during prompt pretraining, simultaneously leading to poor generalization. To address the above issues, in this paper, we propose a general framework termed Revisiting Prompt Pretraining (RPP), which targets at improving the fitting and generalization ability from two aspects: prompt structure and prompt supervision. For prompt structure, we break the restriction in common practice where query, key, and value vectors are derived from the shared…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques