Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Yuhong Liu; Beichen Zhang; Yuhang Zang; Yuhang Cao; Long Xing; Xiaoyi Dong; Haodong Duan; Dahua Lin; Jiaqi Wang

arXiv:2510.27606·cs.CV·November 26, 2025

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang

PDF

Open Access 3 Models 1 Datasets

TL;DR

Spatial-SSRL introduces a self-supervised reinforcement learning approach that enhances spatial reasoning in large vision-language models by leveraging automatically generated, verifiable pretext tasks from ordinary images, improving performance on spatial understanding benchmarks.

Contribution

It presents a novel self-supervised RL paradigm that formulates five verifiable spatial pretext tasks from images, eliminating the need for costly supervision and improving spatial reasoning in LVLMs.

Findings

01

Achieved average accuracy gains of 4.63% and 3.89% on spatial benchmarks.

02

Demonstrated improved spatial reasoning without sacrificing general visual capabilities.

03

Validated effectiveness across multiple image and video spatial understanding tasks.

Abstract

Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

internlm/Spatial-SSRL-81k
dataset· 455 dl
455 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications