M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning
Taowen Wang, Yiyang Liu, James Chenhao Liang, junhan zhao, Yiming Cui,, Yuning Mao, Shaoliang Nie, Jiahao Liu, Fuli Feng, Zenglin Xu, Cheng Han, Lifu, Huang, Qifan Wang, Dongfang Liu

TL;DR
This paper introduces M$^2$PT, a multimodal prompt tuning method that enhances zero-shot instruction learning in large multimodal models by efficiently integrating visual and textual prompts during finetuning.
Contribution
The paper proposes a novel multimodal prompt tuning approach that effectively incorporates prompts into vision and language components, improving zero-shot generalization in MLLMs.
Findings
Outperforms state-of-the-art baselines on multiple multimodal datasets.
Demonstrates the efficiency of prompt-based finetuning over traditional methods.
Ablation studies confirm the effectiveness of the prompt design.
Abstract
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains, with increasing emphasis on enhancing their zero-shot generalization capabilities for unseen tasks across various modalities. Instruction tuning has emerged as an effective strategy for achieving zero-shot generalization by finetuning pretrained models on diverse multimodal tasks. As the scale of MLLMs continues to grow, parameter-efficient finetuning becomes increasingly critical. However, most existing parameter-efficient approaches focus only on single modalities and often overlook the multimodal characteristics during finetuning. In this work, we introduce a novel Multimodal Prompt Tuning (MPT) approach for efficient instruction tuning of MLLMs. MPT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsSparse Evolutionary Training · Focus
