Video Spatial Reasoning with Object-Centric 3D Rollout

Haoran Tang; Meng Cao; Ruyang Liu; Xiaoxi Liang; Linglong Li; Ge Li; Xiaodan Liang

arXiv:2511.13190·cs.CV·November 18, 2025

Video Spatial Reasoning with Object-Centric 3D Rollout

Haoran Tang, Meng Cao, Ruyang Liu, Xiaoxi Liang, Linglong Li, Ge Li, Xiaodan Liang

PDF

Open Access 1 Video

TL;DR

This paper introduces Object-Centric 3D Rollout (OCR), a training strategy that enhances video spatial reasoning in large models by applying structured perturbations to 3D object geometry, leading to state-of-the-art results.

Contribution

The paper presents OCR, a novel training method that improves holistic scene understanding in vision-language models through 3D geometric perturbations and a joint training pipeline.

Findings

01

Achieves 47.5% accuracy on VSI-Bench, surpassing larger baselines.

02

Demonstrates OCR's effectiveness over prior rollout methods.

03

Validates the importance of holistic reasoning in video understanding.

Abstract

Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning-the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes-remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Video Spatial Reasoning with Object-Centric 3D Rollout· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning