The Dual Mechanisms of Spatial Reasoning in Vision-Language Models

Kelly Cui; Nikhil Prakash; Ayush Raina; David Bau; Antonio Torralba; Tamar Rott Shaham

arXiv:2603.22278·cs.CV·March 24, 2026

The Dual Mechanisms of Spatial Reasoning in Vision-Language Models

Kelly Cui, Nikhil Prakash, Ayush Raina, David Bau, Antonio Torralba, Tamar Rott Shaham

PDF

Open Access

TL;DR

This paper reveals that vision-language models primarily rely on vision encoder representations for spatial reasoning, with language layers playing a secondary role, and demonstrates that enhancing these visual spatial signals improves model performance.

Contribution

It uncovers the dual mechanisms of spatial reasoning in VLMs, emphasizing the dominant role of vision encoder representations in spatial association.

Findings

01

Vision encoders encode object layout and background relations.

02

Enhancing visual spatial signals improves reasoning accuracy.

03

Language layers contribute secondary spatial information.

Abstract

Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Topic Modeling