Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

Shengchao Zhou; Yuxin Chen; Yuying Ge; Wei Huang; Jiehong Lin; Ying Shan; Xiaojuan Qi

arXiv:2512.20557·cs.CV·December 24, 2025

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces DSR Suite, a comprehensive framework including a new dataset, benchmark, and model enhancements for improving vision-language models' ability to perform dynamic spatial reasoning in 4D from in-the-wild videos.

Contribution

The paper presents a scalable pipeline for generating 4D-aware training data, a new benchmark for dynamic spatial reasoning, and a lightweight module to incorporate geometric priors into vision-language models.

Findings

01

Enhanced dynamic spatial reasoning in VLMs with DSR-Train and GSM.

02

Maintained performance on general video understanding tasks.

03

Significant improvement in 4D spatial reasoning capabilities.

Abstract

Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
TencentARC/DSR_Suite-Model
model· 30 dl· ♡ 4
30 dl♡ 4

Datasets

TencentARC/DSR_Suite-Data
dataset· 113 dl
113 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Robot Manipulation and Learning