Tuning Multi-mode Token-level Prompt Alignment across Modalities

Dongsheng Wang; Miaoge Li; Xinyang Liu; MingSheng Xu; Bo Chen; Hanwang; Zhang

arXiv:2309.13847·cs.CV·October 27, 2023·2 cites

Tuning Multi-mode Token-level Prompt Alignment across Modalities

Dongsheng Wang, Miaoge Li, Xinyang Liu, MingSheng Xu, Bo Chen, Hanwang, Zhang

PDF

Open Access 1 Video

TL;DR

This paper introduces a multi-mode token-level prompt tuning framework for vision-language models, leveraging optimal transportation to improve semantic alignment and diversity across modalities, resulting in better generalization and few-shot learning.

Contribution

It proposes a novel multi-mode token-level prompt tuning method that captures diverse semantic representations and fine-grained alignment using optimal transportation, surpassing prior single-mode approaches.

Findings

01

Outperforms existing methods on image recognition benchmarks.

02

Enhances few-shot learning capabilities.

03

Learns prompt tokens that capture diverse visual concepts.

Abstract

Advancements in prompt tuning of vision-language models have underscored their potential in enhancing open-world visual concept comprehension. However, prior works only primarily focus on single-mode (only one prompt for each modality) and holistic level (image or sentence) semantic alignment, which fails to capture the sample diversity, leading to sub-optimal prompt discovery. To address the limitation, we propose a multi-mode token-level tuning framework that leverages the optimal transportation to learn and align a set of prompt tokens across modalities. Specifically, we rely on two essential factors: 1) multi-mode prompts discovery, which guarantees diverse semantic representations, and 2) token-level alignment, which helps explore fine-grained similarity. Consequently, the similarity can be calculated as a hierarchical transportation problem between the modality-specific sets.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Tuning Multi-mode Token-level Prompt Alignment across Modalities· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques