Large Vision-Language Models Get Lost in Attention

Gongli Xi; Ye Tian; Mengyu Yang; Huahui Yi; Liang Lin; Xiaoshuai Hao; Kun Wang; Wendong Wang

arXiv:2605.05668·cs.AI·May 8, 2026

Large Vision-Language Models Get Lost in Attention

Gongli Xi, Ye Tian, Mengyu Yang, Huahui Yi, Liang Lin, Xiaoshuai Hao, Kun Wang, Wendong Wang

PDF

TL;DR

This paper introduces a theoretical framework to analyze large vision-language models, revealing that attention and feedforward networks serve distinct functions and that current models may be inefficiently using attention mechanisms.

Contribution

It provides a unified information-theoretic and geometric analysis of residual modules, highlighting functional decoupling and inefficiencies in current LVLM architectures.

Findings

01

Attention acts as a subspace-preserving reconfiguration operator.

02

Feedforward networks serve as subspace-expanding semantic drivers.

03

Replacing learned attention with predefined values maintains or improves performance.

Abstract

Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded in information theory and geometry to quantify the geometric and entropic nature of residual updates. Applying this unified framework reveals a fundamental functional decoupling: Attention acts as a subspace-preserving operator focused on reconfiguration, whereas FFNs serve as subspace-expanding operators driving semantic innovation. Strikingly, further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.