Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention

Shezheng Song; Shasha Li; Jie Yu

arXiv:2601.08151·cs.CV·January 14, 2026

Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention

Shezheng Song, Shasha Li, Jie Yu

PDF

Open Access

TL;DR

This paper investigates how multimodal large language models fuse visual and textual information internally, revealing layer-specific fusion patterns and proposing a contrastive attention method to enhance multimodal reasoning.

Contribution

The study provides a systematic layer-wise analysis of visual-text fusion in MLLMs and introduces a training-free contrastive attention framework to improve their reasoning capabilities.

Findings

01

Fusion occurs at specific layers rather than uniformly.

02

Certain models show late-stage visual signal reactivation.

03

The proposed method improves multimodal reasoning performance.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language understanding, yet how they internally integrate visual and textual information remains poorly understood. To bridge this gap, we perform a systematic layer-wise masking analysis across multiple architectures, revealing how visual-text fusion evolves within MLLMs. The results show that fusion emerges at several specific layers rather than being uniformly distributed across the network, and certain models exhibit a late-stage "review" phenomenon where visual signals are reactivated before output generation. Besides, we further analyze layer-wise attention evolution and observe persistent high-attention noise on irrelevant regions, along with gradually increasing attention on text-aligned areas. Guided by these insights, we introduce a training-free contrastive attention framework that models the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Domain Adaptation and Few-Shot Learning