Modular and Parameter-Efficient Multimodal Fusion with Prompting

Sheng Liang; Mengjie Zhao; Hinrich Sch\"utze

arXiv:2203.08055·cs.CL·March 16, 2022

Modular and Parameter-Efficient Multimodal Fusion with Prompting

Sheng Liang, Mengjie Zhao, Hinrich Sch\"utze

PDF

Open Access

TL;DR

This paper introduces a prompt-based multimodal fusion method that is modular, parameter-efficient, and performs well in low-resource settings, offering an alternative to traditional finetuning for large-scale models.

Contribution

It proposes using prompt vectors for multimodal alignment, providing a flexible and efficient fusion approach that is modular and suitable for multi-modal tasks.

Findings

01

Achieves comparable performance to existing methods in low-resource scenarios

02

Demonstrates modularity and parameter efficiency for multi-modal tasks

03

Effective in aligning different data modalities using prompt vectors

Abstract

Recent research has made impressive progress in large-scale multimodal pre-training. In the context of the rapid growth of model size, it is necessary to seek efficient and flexible methods other than finetuning. In this paper, we propose to use prompt vectors to align the modalities. Our method achieves comparable performance to several other multimodal fusion methods in low-resource settings. We further show that our method is modular and parameter-efficient for processing tasks involving two or more data modalities.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Multimodal Machine Learning Applications · Speech and dialogue systems