FlexAttention for Efficient High-Resolution Vision-Language Models

Junyan Li; Delin Chen; Tianle Cai; Peihao Chen; Yining Hong; Zhenfang; Chen; Yikang Shen; and Chuang Gan

arXiv:2407.20228·cs.CV·July 30, 2024

FlexAttention for Efficient High-Resolution Vision-Language Models

Junyan Li, Delin Chen, Tianle Cai, Peihao Chen, Yining Hong, Zhenfang, Chen, Yikang Shen, and Chuang Gan

PDF

Open Access

TL;DR

FlexAttention introduces a hierarchical, selective attention mechanism that efficiently encodes high-resolution images in vision-language models, reducing computational costs while improving performance on multimodal benchmarks.

Contribution

The paper proposes FlexAttention, a novel hierarchical attention method that selectively processes high-resolution image tokens to enhance efficiency and accuracy in vision-language models.

Findings

01

Outperforms existing high-resolution VLMs by ~9% on V* Bench

02

Achieves ~7% improvement on TextVQA

03

Reduces computational cost by nearly 40%

Abstract

Current high-resolution vision-language models encode images as high-resolution image tokens and exhaustively take all these tokens to compute attention, which significantly increases the computational cost. To address this problem, we propose FlexAttention, a flexible attention mechanism for efficient high-resolution vision-language models. Specifically, a high-resolution image is encoded both as high-resolution tokens and low-resolution tokens, where only the low-resolution tokens and a few selected high-resolution tokens are utilized to calculate the attention map, which greatly shrinks the computational cost. The high-resolution tokens are selected via a high-resolution selection module which could retrieve tokens of relevant regions based on an input attention map. The selected high-resolution tokens are then concatenated to the low-resolution tokens and text tokens, and input to a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques

MethodsSoftmax · Attention Is All You Need