Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach

Jing Bi; Junjia Guo; Yunlong Tang; Lianggong Bruce Wen; Zhang Liu; Chenliang Xu

arXiv:2412.18108·cs.CV·November 12, 2025

Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach

Jing Bi, Junjia Guo, Yunlong Tang, Lianggong Bruce Wen, Zhang Liu, Chenliang Xu

PDF

Open Access

TL;DR

This paper systematically investigates how multimodal large language models interpret visual content, revealing specialized attention heads that focus on visual tokens, thereby enhancing understanding of multimodal processing in LLMs.

Contribution

It uncovers a class of attention heads dedicated to visual content in LLMs and analyzes their behavior across multiple models and scales, advancing multimodal understanding.

Findings

01

Identification of attention heads focusing on visual tokens

02

Strong correlation between attention weights and visual content

03

Insights into LLMs' adaptation to multimodal tasks

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in visual understanding. This impressive leap raises a compelling question: how can language models, initially trained solely on linguistic data, effectively interpret and process visual content? This paper aims to address this question with systematic investigation across 4 model families and 4 model scales, uncovering a unique class of attention heads that focus specifically on visual content. Our analysis reveals a strong correlation between the behavior of these attention heads, the distribution of attention weights, and their concentration on visual tokens within the input. These findings enhance our understanding of how LLMs adapt to multimodal tasks, demonstrating their potential to bridge the gap between textual and visual understanding. This work paves the way for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need · Focus