Parameter-efficient Tuning of Large-scale Multimodal Foundation Model

Haixin Wang; Xinlong Yang; Jianlong Chang; Dian Jin; Jinan Sun; Shikun; Zhang; Xiao Luo; Qi Tian

arXiv:2305.08381·cs.CV·March 1, 2024·6 cites

Parameter-efficient Tuning of Large-scale Multimodal Foundation Model

Haixin Wang, Xinlong Yang, Jianlong Chang, Dian Jin, Jinan Sun, Shikun, Zhang, Xiao Luo, Qi Tian

PDF

Open Access 1 Video

TL;DR

This paper introduces Aurora, a lightweight prompt tuning framework for large-scale multimodal models that significantly reduces parameters while improving modality alignment and outperforming full fine-tuning on multiple benchmarks.

Contribution

Aurora proposes a low-parameter multimodal prompt tuning method with novel modules for enhanced modality alignment, addressing efficiency and effectiveness in cross-modal tasks.

Findings

01

Outperforms state-of-the-art methods on six benchmarks.

02

Uses only 0.04% of pre-trained model parameters.

03

Surpasses full fine-tuning results in experiments.

Abstract

Driven by the progress of large-scale pre-training, parameter-efficient transfer learning has gained immense popularity across different subfields of Artificial Intelligence. The core is to adapt the model to downstream tasks with only a small set of parameters. Recently, researchers have leveraged such proven techniques in multimodal tasks and achieve promising results. However, two critical issues remain unresolved: how to further reduce the complexity with lightweight design and how to boost alignment between modalities under extremely low parameters. In this paper, we propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges. Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning, which explores the low intrinsic dimension…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Parameter-efficient Tuning of Large-scale Multimodal Foundation Model· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques