STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs

Zongzhao Li; Zongyang Ma; Mingze Li; Songyou Li; Yu Rong; Tingyang Xu; Ziqi Zhang; Deli Zhao; Wenbing Huang

arXiv:2505.15804·cs.CV·July 11, 2025

STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs

Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, Wenbing Huang

PDF

Open Access

TL;DR

This paper introduces STAR-R1, a novel reinforcement learning framework that significantly improves spatial reasoning in multimodal large language models by efficiently exploring and accurately identifying object transformations across views.

Contribution

STAR-R1 integrates a single-stage RL paradigm with a fine-grained reward mechanism, advancing spatial reasoning capabilities in multimodal models beyond traditional fine-tuning methods.

Findings

01

STAR-R1 outperforms supervised fine-tuning by 23% in cross-view spatial reasoning.

02

Achieves state-of-the-art results across 11 evaluation metrics.

03

Exhibits human-like reasoning behaviors and object comparison abilities.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in cross-view settings, sparse-reward Reinforcement Learning (RL) suffers from inefficient exploration and slow convergence. To address these limitations, we propose STAR-R1, a novel framework that integrates a single-stage RL paradigm with a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1 rewards partial correctness while penalizing excessive enumeration and passive inaction, enabling efficient exploration and precise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling

MethodsShrink and Fine-Tune