D-Attn: Decomposed Attention for Large Vision-and-Language Models

Chia-Wen Kuo; Sijie Zhu; Fan Chen; Xiaohui Shen; Longyin Wen

arXiv:2502.01906·cs.CV·August 19, 2025

D-Attn: Decomposed Attention for Large Vision-and-Language Models

Chia-Wen Kuo, Sijie Zhu, Fan Chen, Xiaohui Shen, Longyin Wen

PDF

Open Access

TL;DR

This paper introduces D-Attn, a flexible attention architecture for large vision-and-language models that improves visual understanding and reduces computational costs by decomposing and optimizing attention mechanisms.

Contribution

We propose Decomposed Attention (D-Attn), a novel architecture that separates visual and textual attention, enabling targeted improvements and efficiency in LVLMs without affecting pre-trained language capabilities.

Findings

01

Significant performance improvements on multiple image benchmarks.

02

Reduction of visual attention computation from quadratic to linear complexity.

03

Achieved up to 5x faster processing speeds.

Abstract

Large vision-and-language models (LVLMs) have traditionally integrated visual and textual tokens by concatenating them into a single homogeneous input for large language models (LLMs), thereby maximally preserving the pre-trained language capabilities. However, this constrained architecture for visual and textual tokens restricts the design space for processing visual tokens, potentially leading to suboptimal performance and efficiency. In this paper, we propose Decomposed Attention (D-Attn), a more flexible attention architecture for LVLMs, which enables modification of visual token operations without affecting textual-to-textual attention. D-Attn decomposes the 1-D causal self-attention of LVLMs into visual-to-visual, textual-to-visual, and textual-to-textual attentions, and the visual and textual output tokens from the decomposed attentions are merged with a carefully derived…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training