4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding
Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xiang An, Bo Li, Xin Xie, ZiDong Wang, Mingze Sun, Shuang Chen, Hongyu Li, Xiaobin Hu, Ruqi Huang

TL;DR
4DThinker introduces a novel framework enabling vision-language models to perform dynamic spatial reasoning by internally simulating scenes in 4D, improving understanding of complex video-based spatial tasks.
Contribution
The paper presents the first framework for VLMs to think with 4D through dynamic latent imagery, combining data synthesis, fine-tuning, and reinforcement learning.
Findings
Outperforms strong baselines on multiple dynamic spatial reasoning benchmarks.
Introduces a scalable, annotation-free data generation pipeline for 4D reasoning.
Demonstrates the effectiveness of 4D latent simulation in complex video understanding.
Abstract
Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
