Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation
Zeqi Xiao, Yiwei Zhao, Lingxiao Li, Yushi Lan, Ning Yu, Rahul Garg, Roshni Cooper, Mohammad H. Taghavi, Xingang Pan

TL;DR
Video4Spatial demonstrates that video diffusion models conditioned on scene context can perform complex spatial reasoning tasks like navigation and object grounding using only visual data, advancing visuospatial intelligence.
Contribution
The paper introduces Video4Spatial, a novel framework showing that video generative models can exhibit visuospatial reasoning without auxiliary data.
Findings
Successfully performs scene navigation and object grounding tasks.
Maintains spatial consistency and generalizes to new environments.
Demonstrates strong spatial understanding from video context.
Abstract
We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning
