4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation

Ying Zang; Xuanyi Liu; Yidong Han; Deyi Ji; Chaotao Ding; Yuanqi Hu; Qi Zhu; Xuanfu Li; Jin Ma; Lingyun Sun; Tianrun Chen; Lanyun Zhu

arXiv:2605.12027·cs.CV·May 13, 2026

4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation

Ying Zang, Xuanyi Liu, Yidong Han, Deyi Ji, Chaotao Ding, Yuanqi Hu, Qi Zhu, Xuanfu Li, Jin Ma, Lingyun Sun, Tianrun Chen, Lanyun Zhu

PDF

TL;DR

This paper introduces a novel, training-free framework for 4D scene reconstruction from monocular videos that effectively disentangles dynamic and static elements, leading to improved geometric accuracy.

Contribution

The proposed method offers a new decoupling approach with three components that stabilize camera pose, decompose depth manifolds, and adaptively fuse predictions without fine-tuning.

Findings

01

Achieves consistent improvements on 4D reconstruction benchmarks.

02

Performs competitively without requiring fine-tuning.

03

Demonstrates effective dynamic-static disentanglement in complex scenes.

Abstract

Reconstructing dynamic 4D scenes from monocular videos is a fundamental yet challenging task. While recent 3D foundation models provide strong geometric priors, their performance significantly degrades in dynamic environments. This degradation stems from a fundamental tension: the inherent coupling of camera ego-motion and object motion within global attention mechanisms. In this paper, we propose a novel, training-free progressive decoupling framework that disentangles dynamics from statics in a principled, coarse-to-fine manner. Our core insight is to resolve the tension by first stabilizing the camera pose, followed by geometric refinement. Specifically, our approach consists of three synergistic components: (1) a Dynamic-Mask-Guided Pose Decoupling module that isolates pose estimation from dynamic interference, yielding a stable motion-free reference frame; (2) a Topological…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.