Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning

Binbin Ji; Siddharth Agrawal; Qiance Tang; and Yvonne Wu

arXiv:2507.13362·cs.CV·July 21, 2025

Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning

Binbin Ji, Siddharth Agrawal, Qiance Tang, and Yvonne Wu

PDF

Open Access

TL;DR

This paper explores how structured multi-stage prompting and reinforcement learning enhance the spatial reasoning abilities of vision-language models, leading to better accuracy and robustness, especially under out-of-distribution conditions.

Contribution

It introduces SceneGraph CoT prompting and applies Group Relative Policy Optimization to improve spatial reasoning and generalization in vision-language models.

Findings

01

Structured scene graph prompting improves spatial reasoning accuracy.

02

Reinforcement learning with GRPO outperforms supervised fine-tuning.

03

Models trained with GRPO show better robustness to phrasing variations.

Abstract

This study investigates the spatial reasoning capabilities of vision-language models (VLMs) through Chain-of-Thought (CoT) prompting and reinforcement learning. We begin by evaluating the impact of different prompting strategies and find that simple CoT formats, where the model generates a reasoning step before the answer, not only fail to help, but can even harm the model's original performance. In contrast, structured multi-stage prompting based on scene graphs (SceneGraph CoT) significantly improves spatial reasoning accuracy. Furthermore, to improve spatial reasoning ability, we fine-tune models using Group Relative Policy Optimization (GRPO) on the SAT dataset and evaluate their performance on CVBench. Compared to supervised fine-tuning (SFT), GRPO achieves higher accuracy on Pass@1 evaluations and demonstrates superior robustness under out-of-distribution (OOD) conditions. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning