Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

Yifan Shen; Yuanzhe Liu; Jingyuan Zhu; Xu Cao; Xiaofeng Zhang; Yixiao He; Wenming Ye; James Matthew Rehg; Ismini Lourentzou

arXiv:2506.21656·cs.CV·January 6, 2026

Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, Ismini Lourentzou

PDF

1 Models

TL;DR

This paper introduces SpatialReasoner-R1, a model that significantly improves fine-grained spatial reasoning in vision-language models through novel training methods and preference optimization, achieving state-of-the-art results.

Contribution

The paper presents a new model and training framework that enhances spatial reasoning in VLMs, including a Multi-Model Monte Carlo Tree Search and a fine-grained preference optimization technique.

Findings

01

fDPO improves spatial reasoning performance by up to 9.0%.

02

SpatialReasoner-R1 outperforms baselines on SpatialRGPT-Bench.

03

The methods maintain strong performance on general vision-language tasks.

Abstract

Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCOT) reasoning trajectories. In addition, we propose a fine-grained Direct Preference Optimization (fDPO) method that introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
PLAN-Lab/SpatialReasoner-R1
model· 155 dl
155 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.