From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks
Xiaofeng Zhang, Yihao Quan, Chen Shen, Xiaosong Yuan, Shaotian Yan,, Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, Jieping Ye

TL;DR
This paper analyzes how information flows through large vision-language models during reasoning tasks, revealing layer-wise convergence patterns and the impact of visual features on model performance.
Contribution
It introduces an integrated attention analysis method combining LLaVA-CAM with gradient-based insights to study visual information flow in LVLMs.
Findings
Information flow converges in shallow layers
Deeper layers show diversified information processing
Flow patterns vary with context and task
Abstract
Large Vision Language Models (LVLMs) achieve great performance on visual-language reasoning tasks, however, the black-box nature of LVLMs hinders in-depth research on the reasoning mechanism. As all images need to be converted into image tokens to fit the input format of large language models (LLMs) along with natural language prompts, sequential visual representation is essential to the performance of LVLMs, and the information flow analysis approach can be an effective tool for determining interactions between these representations. In this paper, we propose integrating attention analysis with LLaVA-CAM, concretely, attention scores highlight relevant regions during forward propagation, while LLaVA-CAM captures gradient changes through backward propagation, revealing key image features. By exploring the information flow from the perspective of visual representation contribution, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
