4D-VGGT: A General Foundation Model with SpatioTemporal Awareness for Dynamic Scene Geometry Estimation
Haonan Wang, Hanyu Zhou, Haoyue Liu, Luxin Yan

TL;DR
4D-VGGT introduces a versatile foundation model for dynamic scene geometry estimation, effectively integrating multi-view and temporal data through a divide-and-conquer approach, and demonstrating superior performance across multiple benchmarks.
Contribution
The paper presents a novel spatiotemporal representation framework with multi-setting input, multi-level fusion, and multi-task prediction for dynamic scene geometry estimation.
Findings
Effective multi-view and temporal feature integration.
Improved accuracy on dynamic scene geometry benchmarks.
Versatile application across various tasks.
Abstract
We investigate a challenging task of dynamic scene geometry estimation, which requires representing both spatial and temporal features. Typically, existing methods align the two features into a unified latent space to model scene geometry. However, this unified paradigm suffers from potential mismatched representation due to the heterogeneous nature between spatial and temporal features. In this work, we propose 4D-VGGT, a general foundation model with divide-and-conquer spatiotemporal representation for dynamic scene geometry. Our model is divided into three aspects: 1) Multi-setting input. We design an adaptive visual grid that supports input sequences with arbitrary numbers of views and time steps. 2) Multi-level representation. We propose a cross-view global fusion for spatial representation and a cross-time local fusion for temporal representation. 3) Multi-task prediction. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Human Pose and Action Recognition
