Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning
Binbin Ji, Siddharth Agrawal, Qiance Tang, and Yvonne Wu

TL;DR
This paper explores how structured multi-stage prompting and reinforcement learning enhance the spatial reasoning abilities of vision-language models, leading to better accuracy and robustness, especially under out-of-distribution conditions.
Contribution
It introduces SceneGraph CoT prompting and applies Group Relative Policy Optimization to improve spatial reasoning and generalization in vision-language models.
Findings
Structured scene graph prompting improves spatial reasoning accuracy.
Reinforcement learning with GRPO outperforms supervised fine-tuning.
Models trained with GRPO show better robustness to phrasing variations.
Abstract
This study investigates the spatial reasoning capabilities of vision-language models (VLMs) through Chain-of-Thought (CoT) prompting and reinforcement learning. We begin by evaluating the impact of different prompting strategies and find that simple CoT formats, where the model generates a reasoning step before the answer, not only fail to help, but can even harm the model's original performance. In contrast, structured multi-stage prompting based on scene graphs (SceneGraph CoT) significantly improves spatial reasoning accuracy. Furthermore, to improve spatial reasoning ability, we fine-tune models using Group Relative Policy Optimization (GRPO) on the SAT dataset and evaluate their performance on CVBench. Compared to supervised fine-tuning (SFT), GRPO achieves higher accuracy on Pass@1 evaluations and demonstrates superior robustness under out-of-distribution (OOD) conditions. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
