Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention
Shezheng Song, Shasha Li, Jie Yu

TL;DR
This paper investigates how multimodal large language models fuse visual and textual information internally, revealing layer-specific fusion patterns and proposing a contrastive attention method to enhance multimodal reasoning.
Contribution
The study provides a systematic layer-wise analysis of visual-text fusion in MLLMs and introduces a training-free contrastive attention framework to improve their reasoning capabilities.
Findings
Fusion occurs at specific layers rather than uniformly.
Certain models show late-stage visual signal reactivation.
The proposed method improves multimodal reasoning performance.
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language understanding, yet how they internally integrate visual and textual information remains poorly understood. To bridge this gap, we perform a systematic layer-wise masking analysis across multiple architectures, revealing how visual-text fusion evolves within MLLMs. The results show that fusion emerges at several specific layers rather than being uniformly distributed across the network, and certain models exhibit a late-stage "review" phenomenon where visual signals are reactivated before output generation. Besides, we further analyze layer-wise attention evolution and observe persistent high-attention noise on irrelevant regions, along with gradually increasing attention on text-aligned areas. Guided by these insights, we introduce a training-free contrastive attention framework that models the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Domain Adaptation and Few-Shot Learning
