Large Vision-Language Models Get Lost in Attention
Gongli Xi, Ye Tian, Mengyu Yang, Huahui Yi, Liang Lin, Xiaoshuai Hao, Kun Wang, Wendong Wang

TL;DR
This paper introduces a theoretical framework to analyze large vision-language models, revealing that attention and feedforward networks serve distinct functions and that current models may be inefficiently using attention mechanisms.
Contribution
It provides a unified information-theoretic and geometric analysis of residual modules, highlighting functional decoupling and inefficiencies in current LVLM architectures.
Findings
Attention acts as a subspace-preserving reconfiguration operator.
Feedforward networks serve as subspace-expanding semantic drivers.
Replacing learned attention with predefined values maintains or improves performance.
Abstract
Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded in information theory and geometry to quantify the geometric and entropic nature of residual updates. Applying this unified framework reveals a fundamental functional decoupling: Attention acts as a subspace-preserving operator focused on reconfiguration, whereas FFNs serve as subspace-expanding operators driving semantic innovation. Strikingly, further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
