D-Attn: Decomposed Attention for Large Vision-and-Language Models
Chia-Wen Kuo, Sijie Zhu, Fan Chen, Xiaohui Shen, Longyin Wen

TL;DR
This paper introduces D-Attn, a flexible attention architecture for large vision-and-language models that improves visual understanding and reduces computational costs by decomposing and optimizing attention mechanisms.
Contribution
We propose Decomposed Attention (D-Attn), a novel architecture that separates visual and textual attention, enabling targeted improvements and efficiency in LVLMs without affecting pre-trained language capabilities.
Findings
Significant performance improvements on multiple image benchmarks.
Reduction of visual attention computation from quadratic to linear complexity.
Achieved up to 5x faster processing speeds.
Abstract
Large vision-and-language models (LVLMs) have traditionally integrated visual and textual tokens by concatenating them into a single homogeneous input for large language models (LLMs), thereby maximally preserving the pre-trained language capabilities. However, this constrained architecture for visual and textual tokens restricts the design space for processing visual tokens, potentially leading to suboptimal performance and efficiency. In this paper, we propose Decomposed Attention (D-Attn), a more flexible attention architecture for LVLMs, which enables modification of visual token operations without affecting textual-to-textual attention. D-Attn decomposes the 1-D causal self-attention of LVLMs into visual-to-visual, textual-to-visual, and textual-to-textual attentions, and the visual and textual output tokens from the decomposed attentions are merged with a carefully derived…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training
