LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models

Tzu-Tao Chang; Shivaram Venkataraman

arXiv:2502.02406·cs.CV·May 29, 2025

LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models

Tzu-Tao Chang, Shivaram Venkataraman

PDF

Open Access 1 Video

TL;DR

LV-XAttn is a distributed cross-attention mechanism designed for large visual inputs in multimodal large language models, significantly reducing communication overhead and improving training and inference efficiency.

Contribution

We introduce LV-XAttn, a novel distributed cross-attention method that minimizes communication overhead by keeping large key-value blocks local and exchanging smaller query blocks, enabling efficient processing of large visual inputs.

Findings

01

Achieves up to 10.62× speedup over existing methods.

02

Reduces communication overhead in distributed cross-attention.

03

Supports longer visual context with activation recomputation.

Abstract

Cross-attention is commonly adopted in multimodal large language models (MLLMs) for integrating visual information into the language backbone. However, in applications with large visual inputs, such as video understanding, processing a large number of visual tokens in cross-attention layers leads to high memory demands and often necessitates distributed computation across multiple GPUs. Existing distributed attention mechanisms face significant communication overheads, making cross-attention layers a critical bottleneck for efficient training and inference of MLLMs. To address this, we propose LV-XAttn, a distributed, exact cross-attention mechanism with minimal communication overhead. We observe that in applications involving large visual inputs, the size of the query block is typically much smaller than that of the key-value blocks. Thus, in LV-XAttn we keep the large key-value blocks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling

MethodsSoftmax · Attention Is All You Need · LLaMA