Modular and Parameter-Efficient Multimodal Fusion with Prompting
Sheng Liang, Mengjie Zhao, Hinrich Sch\"utze

TL;DR
This paper introduces a prompt-based multimodal fusion method that is modular, parameter-efficient, and performs well in low-resource settings, offering an alternative to traditional finetuning for large-scale models.
Contribution
It proposes using prompt vectors for multimodal alignment, providing a flexible and efficient fusion approach that is modular and suitable for multi-modal tasks.
Findings
Achieves comparable performance to existing methods in low-resource scenarios
Demonstrates modularity and parameter efficiency for multi-modal tasks
Effective in aligning different data modalities using prompt vectors
Abstract
Recent research has made impressive progress in large-scale multimodal pre-training. In the context of the rapid growth of model size, it is necessary to seek efficient and flexible methods other than finetuning. In this paper, we propose to use prompt vectors to align the modalities. Our method achieves comparable performance to several other multimodal fusion methods in low-resource settings. We further show that our method is modular and parameter-efficient for processing tasks involving two or more data modalities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Multimodal Machine Learning Applications · Speech and dialogue systems
