Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning
Xueqi Ma, Shuo Yang, Yanbei Jiang, Shu Liu, Zhenzhen Liu, Jiayang Ao, Xingjun Ma, Sarah Monazam Erfani, and James Bailey

TL;DR
This paper investigates how attention heads in Vision-Language Models contribute to spatial reasoning, introducing a new dataset and framework to analyze and improve the models' understanding of space.
Contribution
It presents CogVSR, a dataset for decomposing spatial reasoning, and a probing framework to identify and enhance spatially specialized attention heads in VLMs.
Findings
Spatially specialized heads are fewer than other functions.
Removing functional heads degrades spatial reasoning performance.
Activating spatial heads improves model accuracy.
Abstract
Despite remarkable advances in large Vision-Language Models (VLMs), spatial reasoning remains a persistent challenge. In this work, we investigate how attention heads within VLMs contribute to spatial reasoning by analyzing their functional roles through a mechanistic interpretability lens. We introduce CogVSR, a dataset that decomposes complex spatial reasoning questions into step-by-step subquestions designed to simulate human-like reasoning via a chain-of-thought paradigm, with each subquestion linked to specific cognitive functions such as spatial perception or relational reasoning. Building on CogVSR, we develop a probing framework to identify and characterize attention heads specialized for these functions. Our analysis across diverse VLM families reveals that these functional heads are universally sparse, vary in number and distribution across functions. Notably, spatially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Language, Metaphor, and Cognition
