A-VL: Adaptive Attention for Large Vision-Language Models
Junyang Zhang, Mu Yuan, Ruiguang Zhong, Puhan Luo, Huiyou Zhan,, Ningkang Zhang, Chengchen Hu, Xiangyang Li

TL;DR
A-VL introduces a novel adaptive attention mechanism for large vision-language models that dynamically manages attention across modalities, significantly reducing resource consumption while maintaining high performance.
Contribution
The paper presents A-VL, a tailored adaptive attention method for LVLMs that manages visual and language attention separately, improving efficiency during inference.
Findings
Reduces memory and computation in LVLMs
Maintains performance across multiple tasks and datasets
Outperforms existing adaptive attention methods
Abstract
The Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. However, these models demand extensive resources during inference. Adaptive attention techniques can dynamically reduce computational redundancy and thus improve efficiency. Although current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models, they are not tailored for LVLMs. We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns. This observation inspires us to manage the attention for each modality separately. Specifically, for visual input, we store the cache of potentially useful information but only compute the most critical parts. For language input, we care more about local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning
MethodsSoftmax · Attention Is All You Need
