Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
Shibo Jie, Yehui Tang, Ning Ding, Zhi-Hong Deng, Kai Han, Yunhe Wang

TL;DR
This paper introduces MemVP, a novel visual prompting method that injects visual knowledge into language models by concatenating visual prompts with FFN weights, significantly improving efficiency and performance in vision-language tasks.
Contribution
MemVP presents a new approach to visual prompting by integrating visual prompts into FFN weights, reducing training time and inference latency while outperforming previous PEFT methods.
Findings
MemVP reduces training time and inference latency.
MemVP surpasses previous PEFT methods in performance.
Visual prompts as knowledge injection improve VL task efficiency.
Abstract
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm: projecting the output of pre-trained vision encoders to the input space of pre-trained language models as visual prompts; and then transferring the models to downstream VL tasks via end-to-end parameter-efficient fine-tuning (PEFT). However, this paradigm still exhibits inefficiency since it significantly increases the input length of the language models. In this paper, in contrast to integrating visual prompts into inputs, we regard visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information. Motivated by the finding that Feed-Forward Network (FFN) of language models acts as "key-value memory", we introduce a novel approach termed memory-space visual prompting (MemVP), wherein visual prompts are concatenated with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Constraint Satisfaction and Optimization · Multimodal Machine Learning Applications
