Enhancing Spatial Reasoning through Visual and Textual Thinking

Xun Liang; Xin Guo; Zhongming Jin; Weihang Pan; Penghui Shang; Deng Cai; Binbin Lin; Jieping Ye

arXiv:2507.20529·cs.CV·July 29, 2025

Enhancing Spatial Reasoning through Visual and Textual Thinking

Xun Liang, Xin Guo, Zhongming Jin, Weihang Pan, Penghui Shang, Deng Cai, Binbin Lin, Jieping Ye

PDF

1 Video

TL;DR

This paper presents SpatialVTS, a method that enhances spatial reasoning in vision language models by integrating visual and textual thinking processes, leading to significant improvements without extra data modalities.

Contribution

The paper introduces a novel approach combining visual and textual reasoning phases to improve spatial understanding in VLMs, with dataset corrections and logical reasoning enhancements.

Findings

01

Significant improvement in spatial understanding tasks

02

Effective integration of visual and textual reasoning

03

No additional data modalities required

Abstract

The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly in recent years, they are still struggling with the spatial reasoning task. In this paper, we introduce a method that can enhance Spatial reasoning through Visual and Textual thinking Simultaneously (SpatialVTS). In the spatial visual thinking phase, our model is trained to generate location-related specific tokens of essential targets automatically. Not only are the objects mentioned in the problem addressed, but also the potential objects related to the reasoning are considered. During the spatial textual thinking phase, Our model conducts long-term thinking based on visual cues and dialogues, gradually inferring the answers to spatial reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Enhancing Spatial Reasoning Through Visual and Textual Thinking· underline