Thinking with Spatial Code for Physical-World Video Reasoning

Jieneng Chen; Wenxin Ma; Ruisheng Yuan; Yunzhi Zhang; Jiajun Wu; Alan Yuille

arXiv:2603.05591·cs.CV·March 9, 2026

Thinking with Spatial Code for Physical-World Video Reasoning

Jieneng Chen, Wenxin Ma, Ruisheng Yuan, Yunzhi Zhang, Jiajun Wu, Alan Yuille

PDF

Open Access

TL;DR

This paper presents a framework that converts RGB videos into explicit 3D spatial representations, enabling LLMs to perform physical-world video reasoning with improved accuracy and interpretability.

Contribution

The introduction of a spatial encoder that unifies 6D object parsing with geometric prediction and the finetuning of LLMs with a spatial rubric reward for perspective-aware reasoning.

Findings

01

Outperforms proprietary vision-language models on VSI-Bench

02

Achieves state-of-the-art results in physical-world video reasoning

03

Demonstrates effective parsing of videos into structured 3D spatial codes

Abstract

We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to reason directly over explicit spatial variables. Specifically, we propose the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and we further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference. As a result, our model outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art. Code is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning