M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Taowen Wang; Yiyang Liu; James Chenhao Liang; junhan zhao; Yiming Cui,; Yuning Mao; Shaoliang Nie; Jiahao Liu; Fuli Feng; Zenglin Xu; Cheng Han; Lifu; Huang; Qifan Wang; Dongfang Liu

arXiv:2409.15657·cs.AI·October 31, 2024

M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Taowen Wang, Yiyang Liu, James Chenhao Liang, junhan zhao, Yiming Cui,, Yuning Mao, Shaoliang Nie, Jiahao Liu, Fuli Feng, Zenglin Xu, Cheng Han, Lifu, Huang, Qifan Wang, Dongfang Liu

PDF

Open Access 2 Repos

TL;DR

This paper introduces M$^2$PT, a multimodal prompt tuning method that enhances zero-shot instruction learning in large multimodal models by efficiently integrating visual and textual prompts during finetuning.

Contribution

The paper proposes a novel multimodal prompt tuning approach that effectively incorporates prompts into vision and language components, improving zero-shot generalization in MLLMs.

Findings

01

Outperforms state-of-the-art baselines on multiple multimodal datasets.

02

Demonstrates the efficiency of prompt-based finetuning over traditional methods.

03

Ablation studies confirm the effectiveness of the prompt design.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains, with increasing emphasis on enhancing their zero-shot generalization capabilities for unseen tasks across various modalities. Instruction tuning has emerged as an effective strategy for achieving zero-shot generalization by finetuning pretrained models on diverse multimodal tasks. As the scale of MLLMs continues to grow, parameter-efficient finetuning becomes increasingly critical. However, most existing parameter-efficient approaches focus only on single modalities and often overlook the multimodal characteristics during finetuning. In this work, we introduce a novel Multimodal Prompt Tuning (M $^{2}$ PT) approach for efficient instruction tuning of MLLMs. M $^{2}$ PT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsSparse Evolutionary Training · Focus