SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization
Peiyao Wang, Haibin Ling

TL;DR
This paper introduces SVQA-R1, a novel reinforcement learning framework that enhances spatial reasoning in vision-language models for VQA tasks by using view-consistent rewards, leading to improved accuracy and interpretability.
Contribution
We extend the R1 paradigm to spatial VQA with a new group-wise RL strategy called Spatial-GRPO, promoting grounded spatial understanding without supervised fine-tuning.
Findings
Significant accuracy improvements on spatial VQA benchmarks.
Model exhibits interpretable reasoning paths.
Effective across multiple spatial reasoning tasks.
Abstract
Spatial reasoning remains a critical yet underdeveloped capability in existing vision-language models (VLMs), especially for Spatial Visual Question Answering (Spatial VQA) tasks that require understanding relative positions, distances, and object configurations. Inspired by the R1 paradigm introduced in DeepSeek-R1, which enhances reasoning in language models through rule-based reinforcement learning (RL), we propose SVQA-R1, the first framework to extend R1-style training to spatial VQA. In particular, we introduce Spatial-GRPO, a novel group-wise RL strategy that constructs view-consistent rewards by perturbing spatial relations between objects, e.g., mirror flipping, thereby encouraging the model to develop a consistent and grounded understanding of space. Our model, SVQA-R1, not only achieves dramatically improved accuracy on spatial VQA benchmarks but also exhibits interpretable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Constraint Satisfaction and Optimization · Logic, Reasoning, and Knowledge
