Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning

Shibo Jie; Yehui Tang; Ning Ding; Zhi-Hong Deng; Kai Han; Yunhe Wang

arXiv:2405.05615·cs.CV·May 10, 2024·1 cites

Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning

Shibo Jie, Yehui Tang, Ning Ding, Zhi-Hong Deng, Kai Han, Yunhe Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces MemVP, a novel visual prompting method that injects visual knowledge into language models by concatenating visual prompts with FFN weights, significantly improving efficiency and performance in vision-language tasks.

Contribution

MemVP presents a new approach to visual prompting by integrating visual prompts into FFN weights, reducing training time and inference latency while outperforming previous PEFT methods.

Findings

01

MemVP reduces training time and inference latency.

02

MemVP surpasses previous PEFT methods in performance.

03

Visual prompts as knowledge injection improve VL task efficiency.

Abstract

Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm: projecting the output of pre-trained vision encoders to the input space of pre-trained language models as visual prompts; and then transferring the models to downstream VL tasks via end-to-end parameter-efficient fine-tuning (PEFT). However, this paradigm still exhibits inefficiency since it significantly increases the input length of the language models. In this paper, in contrast to integrating visual prompts into inputs, we regard visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information. Motivated by the finding that Feed-Forward Network (FFN) of language models acts as "key-value memory", we introduce a novel approach termed memory-space visual prompting (MemVP), wherein visual prompts are concatenated with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jieshibo/memvp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Constraint Satisfaction and Optimization · Multimodal Machine Learning Applications