LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models
Tzu-Tao Chang, Shivaram Venkataraman

TL;DR
LV-XAttn is a distributed cross-attention mechanism designed for large visual inputs in multimodal large language models, significantly reducing communication overhead and improving training and inference efficiency.
Contribution
We introduce LV-XAttn, a novel distributed cross-attention method that minimizes communication overhead by keeping large key-value blocks local and exchanging smaller query blocks, enabling efficient processing of large visual inputs.
Findings
Achieves up to 10.62× speedup over existing methods.
Reduces communication overhead in distributed cross-attention.
Supports longer visual context with activation recomputation.
Abstract
Cross-attention is commonly adopted in multimodal large language models (MLLMs) for integrating visual information into the language backbone. However, in applications with large visual inputs, such as video understanding, processing a large number of visual tokens in cross-attention layers leads to high memory demands and often necessitates distributed computation across multiple GPUs. Existing distributed attention mechanisms face significant communication overheads, making cross-attention layers a critical bottleneck for efficient training and inference of MLLMs. To address this, we propose LV-XAttn, a distributed, exact cross-attention mechanism with minimal communication overhead. We observe that in applications involving large visual inputs, the size of the query block is typically much smaller than that of the key-value blocks. Thus, in LV-XAttn we keep the large key-value blocks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling
MethodsSoftmax · Attention Is All You Need · LLaMA
