MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension
Ting Liu, Zunnan Xu, Yue Hu, Liangtao Shi, Zhiqiang Wang, Quanjun Yin

TL;DR
MaPPER introduces a multimodal prior-guided parameter-efficient tuning framework for referring expression comprehension, achieving high accuracy with minimal parameter updates by leveraging specialized adapters and prior-guided modules.
Contribution
The paper proposes MaPPER, a novel PETL framework with dynamic prior adapters and local convolution modules tailored for REC, improving efficiency and accuracy over existing methods.
Findings
MaPPER outperforms full fine-tuning and other PETL methods on three benchmarks.
Achieves best accuracy with only 1.41% of tunable backbone parameters.
Demonstrates effective cross-modal alignment with prior-guided modules.
Abstract
Referring Expression Comprehension (REC), which aims to ground a local visual region via natural language, is a task that heavily relies on multimodal alignment. Most existing methods utilize powerful pre-trained models to transfer visual/linguistic knowledge by full fine-tuning. However, full fine-tuning the entire backbone not only breaks the rich prior knowledge embedded in the pre-training, but also incurs significant computational costs. Motivated by the recent emergence of Parameter-Efficient Transfer Learning (PETL) methods, we aim to solve the REC task in an effective and efficient manner. Directly applying these PETL methods to the REC task is inappropriate, as they lack the specific-domain abilities for precise local visual perception and visual-language alignment. Therefore, we propose a novel framework of Multimodal Prior-guided Parameter Efficient Tuning, namely MaPPER.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Natural Language Processing Techniques
MethodsConvolution
