Investigating The Functional Roles of Attention Heads in Vision Language Models: Evidence for Reasoning Modules

Yanbei Jiang; Xueqi Ma; Shu Liu; Sarah Monazam Erfani; Tongliang Liu; James Bailey; Jey Han Lau; Krista A. Ehinger

arXiv:2512.10300·cs.AI·December 12, 2025

Investigating The Functional Roles of Attention Heads in Vision Language Models: Evidence for Reasoning Modules

Yanbei Jiang, Xueqi Ma, Shu Liu, Sarah Monazam Erfani, Tongliang Liu, James Bailey, Jey Han Lau, Krista A. Ehinger

PDF

Open Access

TL;DR

This paper introduces CogVision, a dataset and interpretability framework to analyze attention heads in vision-language models, revealing their specialized roles in multimodal reasoning and their importance for model performance.

Contribution

It presents a novel methodology to identify and characterize functional attention heads in VLMs, linking them to reasoning processes and demonstrating their impact through intervention experiments.

Findings

01

Functional heads are sparse and vary across models.

02

Removing functional heads degrades multimodal reasoning performance.

03

Emphasizing functional heads improves accuracy.

Abstract

Despite excelling on multimodal benchmarks, vision-language models (VLMs) largely remain a black box. In this paper, we propose a novel interpretability framework to systematically analyze the internal mechanisms of VLMs, focusing on the functional roles of attention heads in multimodal reasoning. To this end, we introduce CogVision, a dataset that decomposes complex multimodal questions into step-by-step subquestions designed to simulate human reasoning through a chain-of-thought paradigm, with each subquestion associated with specific receptive or cognitive functions such as high-level visual reception and inference. Using a probing-based methodology, we identify attention heads that specialize in these functions and characterize them as functional heads. Our analysis across diverse VLM families reveals that these functional heads are universally sparse, vary in number and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Neurobiology of Language and Bilingualism