SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark

TL;DR
SpatialThinker is a novel multimodal language model that enhances 3D spatial reasoning by integrating structured spatial grounding and multi-step reasoning through reinforcement learning, leading to improved performance on spatial understanding tasks.
Contribution
It introduces a new 3D-aware MLLM with a spatial reasoning framework, a high-quality spatial VQA dataset, and an RL training method with dense spatial rewards, advancing spatial understanding in multimodal models.
Findings
Outperforms supervised fine-tuning and sparse RL baselines on spatial tasks.
Nearly doubles the gain of the base model in spatial understanding.
Surpasses GPT-4o in real-world VQA benchmarks.
Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* The paper is well written and easy to follow. * The experimental results are strong, showing consistent improvements over supervised fine-tuning and vanilla GRPO across multiple spatial reasoning benchmarks. * The method works well under limited data, which is practical and valuable for real-world settings.
* Although the results are good, the overall contribution feels somewhat limited. It is not surprising to see RL fine-tuning outperform supervised fine-tuning when data is limited. The grounding and counting rewards also seem broadly useful for visual understanding, not specifically tied to 3D reasoning. * The data synthesis pipeline mainly relies on VG annotations and converting them into QA format through an LLM. This limits the scalability to the scope and patterns of VG and may not cover mo
1. The lexicographically ordered multi-objective reward formulation is clearly defined and well-justified. The authors explicitly specify the gating logic, count penalty, and CIoU-based spatial localization terms, demonstrating a careful design that mitigates reward hacking and promotes interpretable policy learning. 2. The method achieves strong data efficiency, training on only ~7K samples while surpassing both supervised fine-tuning and sparse-reward RL baselines across six spatial reasoning
1. The evaluation omits several strong and relevant baselines. Notably, it does not compare against recent closed-source models such as Gemini 2.5 or contemporary open-source spatial reasoning models such as SpatialReasoner and Space-Qwen, limiting the strength of the empirical claims. 2. Limited novelty in principle. While the integration of scene-graph grounding with gated multi-objective rewards is technically coherent, the individual components—scene-graph grounding and reinforcement learni
1. The introduction of a dense, lexicographically gated reward combining format, count, accuracy, and spatial objectives is both elegant and effective. It provides continuous feedback for intermediate reasoning steps instead of only rewarding final answers , which is a key step toward process-level RL in multimodal settings. 2. Integrating question-focused subgraphs into the reasoning process is a well-motivated and interpretable approach. It encourages localized spatial grounding and generates
1. Overall, the paper mainly contributes by introducing a reward-based framework that supervises dense spatial information through a scene-level reward, including counting and localization components. Although the dataset is generated using existing methods, the overall approach is relatively straightforward. The counting reward is a reasonable design, but the CIoU term is essentially a variant of the IoU metric, which has already been widely used in vision–language models trained with reinforce
(1) The proposed reward function, which includes format, count, accuracy, and CIoU-based spatial rewards with lexicographic priority, provides a rich and structured learning signal that enables more stable and focused RL training. (2) SPATIALTHINKER is trained with only 7K samples but achieves strong generalization across 12 benchmarks, outperforming both supervised fine-tuning and sparse RL baselines. (3) The paper is the first to combine scene graph-based spatial grounding with online policy
(1) While the model introduces a dense multi-objective reward design, it remains unclear how each reward component individually contributes to training, as no detailed ablation is provided. (2) STVQA-7K focuses on single-image spatial reasoning, while most real-world spatial tasks require multi-view or video-based reasoning. It is unclear whether SPATIALTHINKER can handle such setting. Additionally, the scarcity of densely annotated spatial VQA data may limit its broader applicability. (3) The
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
