The Dual Mechanisms of Spatial Reasoning in Vision-Language Models
Kelly Cui, Nikhil Prakash, Ayush Raina, David Bau, Antonio Torralba, Tamar Rott Shaham

TL;DR
This paper reveals that vision-language models primarily rely on vision encoder representations for spatial reasoning, with language layers playing a secondary role, and demonstrates that enhancing these visual spatial signals improves model performance.
Contribution
It uncovers the dual mechanisms of spatial reasoning in VLMs, emphasizing the dominant role of vision encoder representations in spatial association.
Findings
Vision encoders encode object layout and background relations.
Enhancing visual spatial signals improves reasoning accuracy.
Language layers contribute secondary spatial information.
Abstract
Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Topic Modeling
