ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language   Tuning

Zhiwei Hao; Jianyuan Guo; Li Shen; Yong Luo; Han Hu and; Yonggang Wen

arXiv:2410.17779·cs.CV·October 24, 2024

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu and, Yonggang Wen

PDF

Open Access 1 Repo

TL;DR

ADEM-VL introduces an efficient, parameter-reduced vision-language fusion method that enhances multimodal task performance while significantly decreasing computational costs and training time.

Contribution

The paper presents a novel adaptive, parameter-free cross-attention fusion approach that embeds vision features into language models, improving efficiency and effectiveness in multimodal tasks.

Findings

01

Outperforms existing methods in visual question answering and image captioning.

02

Achieves 0.77% higher accuracy on ScienceQA dataset.

03

Reduces training and inference latency significantly.

Abstract

Recent advancements in multimodal fusion have witnessed the remarkable success of vision-language (VL) models, which excel in various multimodal applications such as image captioning and visual question answering. However, building VL models requires substantial hardware resources, where efficiency is restricted by two key factors: the extended input sequence of the language model with vision features demands more computational operations, and a large number of additional learnable parameters increase memory complexity. These challenges significantly restrict the broader applicability of such models. To bridge this gap, we propose ADEM-VL, an efficient vision-language method that tunes VL models based on pretrained large language models (LLMs) by adopting a parameter-free cross-attention mechanism for similarity measurements in multimodal fusion. This approach only requires embedding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hao840/adem-vl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization

MethodsSoftmax · Attention Is All You Need