STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, Wenbing Huang

TL;DR
This paper introduces STAR-R1, a novel reinforcement learning framework that significantly improves spatial reasoning in multimodal large language models by efficiently exploring and accurately identifying object transformations across views.
Contribution
STAR-R1 integrates a single-stage RL paradigm with a fine-grained reward mechanism, advancing spatial reasoning capabilities in multimodal models beyond traditional fine-tuning methods.
Findings
STAR-R1 outperforms supervised fine-tuning by 23% in cross-view spatial reasoning.
Achieves state-of-the-art results across 11 evaluation metrics.
Exhibits human-like reasoning behaviors and object comparison abilities.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in cross-view settings, sparse-reward Reinforcement Learning (RL) suffers from inefficient exploration and slow convergence. To address these limitations, we propose STAR-R1, a novel framework that integrates a single-stage RL paradigm with a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1 rewards partial correctness while penalizing excessive enumeration and passive inaction, enabling efficient exploration and precise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling
MethodsShrink and Fine-Tune
