4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation
Ying Zang, Xuanyi Liu, Yidong Han, Deyi Ji, Chaotao Ding, Yuanqi Hu, Qi Zhu, Xuanfu Li, Jin Ma, Lingyun Sun, Tianrun Chen, Lanyun Zhu

TL;DR
This paper introduces a novel, training-free framework for 4D scene reconstruction from monocular videos that effectively disentangles dynamic and static elements, leading to improved geometric accuracy.
Contribution
The proposed method offers a new decoupling approach with three components that stabilize camera pose, decompose depth manifolds, and adaptively fuse predictions without fine-tuning.
Findings
Achieves consistent improvements on 4D reconstruction benchmarks.
Performs competitively without requiring fine-tuning.
Demonstrates effective dynamic-static disentanglement in complex scenes.
Abstract
Reconstructing dynamic 4D scenes from monocular videos is a fundamental yet challenging task. While recent 3D foundation models provide strong geometric priors, their performance significantly degrades in dynamic environments. This degradation stems from a fundamental tension: the inherent coupling of camera ego-motion and object motion within global attention mechanisms. In this paper, we propose a novel, training-free progressive decoupling framework that disentangles dynamics from statics in a principled, coarse-to-fine manner. Our core insight is to resolve the tension by first stabilizing the camera pose, followed by geometric refinement. Specifically, our approach consists of three synergistic components: (1) a Dynamic-Mask-Guided Pose Decoupling module that isolates pose estimation from dynamic interference, yielding a stable motion-free reference frame; (2) a Topological…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
