CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Yuan Yao; Ao Zhang; Zhengyan Zhang; Zhiyuan Liu; Tat-Seng Chua,; Maosong Sun

arXiv:2109.11797·cs.CV·May 23, 2022·86 cites

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua,, Maosong Sun

PDF

Open Access 2 Repos

TL;DR

CPT introduces a novel prompt tuning method for vision-language models that reformulates visual grounding as a fill-in-the-blank task, significantly enhancing few-shot and zero-shot performance with minimal labeled data.

Contribution

The paper proposes Cross-modal Prompt Tuning (CPT), a new paradigm that bridges the gap between pre-training and fine-tuning in VL-PTMs using color-based co-referential prompts.

Findings

01

Outperforms fine-tuning methods with large accuracy gains.

02

Enables strong few-shot and zero-shot visual grounding capabilities.

03

Reduces standard deviation in performance, indicating more stable results.

Abstract

Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabilities in grounding natural language in image data, facilitating a broad variety of cross-modal tasks. However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for large amounts of labeled data to stimulate the visual grounding capability of VL-PTMs for downstream tasks. To address the challenge, we present Cross-modal Prompt Tuning (CPT, alternatively, Colorful Prompt Tuning), a novel paradigm for tuning VL-PTMs, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, CPT enables strong few-shot and even zero-shot visual grounding capabilities of VL-PTMs. Comprehensive experimental results show that the prompt-tuned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling