Video Spatial Reasoning with Object-Centric 3D Rollout
Haoran Tang, Meng Cao, Ruyang Liu, Xiaoxi Liang, Linglong Li, Ge Li, Xiaodan Liang

TL;DR
This paper introduces Object-Centric 3D Rollout (OCR), a training strategy that enhances video spatial reasoning in large models by applying structured perturbations to 3D object geometry, leading to state-of-the-art results.
Contribution
The paper presents OCR, a novel training method that improves holistic scene understanding in vision-language models through 3D geometric perturbations and a joint training pipeline.
Findings
Achieves 47.5% accuracy on VSI-Bench, surpassing larger baselines.
Demonstrates OCR's effectiveness over prior rollout methods.
Validates the importance of holistic reasoning in video understanding.
Abstract
Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning-the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes-remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
