ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu and, Yonggang Wen

TL;DR
ADEM-VL introduces an efficient, parameter-reduced vision-language fusion method that enhances multimodal task performance while significantly decreasing computational costs and training time.
Contribution
The paper presents a novel adaptive, parameter-free cross-attention fusion approach that embeds vision features into language models, improving efficiency and effectiveness in multimodal tasks.
Findings
Outperforms existing methods in visual question answering and image captioning.
Achieves 0.77% higher accuracy on ScienceQA dataset.
Reduces training and inference latency significantly.
Abstract
Recent advancements in multimodal fusion have witnessed the remarkable success of vision-language (VL) models, which excel in various multimodal applications such as image captioning and visual question answering. However, building VL models requires substantial hardware resources, where efficiency is restricted by two key factors: the extended input sequence of the language model with vision features demands more computational operations, and a large number of additional learnable parameters increase memory complexity. These challenges significantly restrict the broader applicability of such models. To bridge this gap, we propose ADEM-VL, an efficient vision-language method that tunes VL models based on pretrained large language models (LLMs) by adopting a parameter-free cross-attention mechanism for similarity measurements in multimodal fusion. This approach only requires embedding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
MethodsSoftmax · Attention Is All You Need
