TL;DR
This paper introduces SpatialReasoner-R1, a model that significantly improves fine-grained spatial reasoning in vision-language models through novel training methods and preference optimization, achieving state-of-the-art results.
Contribution
The paper presents a new model and training framework that enhances spatial reasoning in VLMs, including a Multi-Model Monte Carlo Tree Search and a fine-grained preference optimization technique.
Findings
fDPO improves spatial reasoning performance by up to 9.0%.
SpatialReasoner-R1 outperforms baselines on SpatialRGPT-Bench.
The methods maintain strong performance on general vision-language tasks.
Abstract
Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCOT) reasoning trajectories. In addition, we propose a fine-grained Direct Preference Optimization (fDPO) method that introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
