Thinking with Spatial Code for Physical-World Video Reasoning
Jieneng Chen, Wenxin Ma, Ruisheng Yuan, Yunzhi Zhang, Jiajun Wu, Alan Yuille

TL;DR
This paper presents a framework that converts RGB videos into explicit 3D spatial representations, enabling LLMs to perform physical-world video reasoning with improved accuracy and interpretability.
Contribution
The introduction of a spatial encoder that unifies 6D object parsing with geometric prediction and the finetuning of LLMs with a spatial rubric reward for perspective-aware reasoning.
Findings
Outperforms proprietary vision-language models on VSI-Bench
Achieves state-of-the-art results in physical-world video reasoning
Demonstrates effective parsing of videos into structured 3D spatial codes
Abstract
We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to reason directly over explicit spatial variables. Specifically, we propose the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and we further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference. As a result, our model outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art. Code is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
