Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi

TL;DR
This paper introduces DSR Suite, a comprehensive framework including a new dataset, benchmark, and model enhancements for improving vision-language models' ability to perform dynamic spatial reasoning in 4D from in-the-wild videos.
Contribution
The paper presents a scalable pipeline for generating 4D-aware training data, a new benchmark for dynamic spatial reasoning, and a lightweight module to incorporate geometric priors into vision-language models.
Findings
Enhanced dynamic spatial reasoning in VLMs with DSR-Train and GSM.
Maintained performance on general video understanding tasks.
Significant improvement in 4D spatial reasoning capabilities.
Abstract
Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Robot Manipulation and Learning
