VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction

Yu Hu; Chong Cheng; Sicheng Yu; Xiaoyang Guo; Hao Wang

arXiv:2511.19971·cs.CV·November 26, 2025

VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction

Yu Hu, Chong Cheng, Sicheng Yu, Xiaoyang Guo, Hao Wang

PDF

Open Access

TL;DR

VGGT4D is a training-free framework that enhances 4D scene reconstruction by mining and amplifying dynamic cues from a 3D foundation model, enabling robust segmentation and pose estimation in dynamic scenes.

Contribution

It introduces a novel method to mine dynamic cues from VGGT's attention layers and integrates them into inference, eliminating the need for external priors or fine-tuning.

Findings

01

Outperforms existing methods in dynamic segmentation and reconstruction

02

Supports single-pass inference on sequences over 500 frames

03

Effectively disentangles static and dynamic scene elements

Abstract

Reconstructing dynamic 4D scenes is challenging, as it requires robust disentanglement of dynamic objects from the static background. While 3D foundation models like VGGT provide accurate 3D geometry, their performance drops markedly when moving objects dominate. Existing 4D approaches often rely on external priors, heavy post-optimization, or require fine-tuning on 4D datasets. In this paper, we propose VGGT4D, a training-free framework that extends the 3D foundation model VGGT for robust 4D scene reconstruction. Our approach is motivated by the key finding that VGGT's global attention layers already implicitly encode rich, layer-wise dynamic cues. To obtain masks that decouple static and dynamic elements, we mine and amplify global dynamic cues via gram similarity and aggregate them across a temporal window. To further sharpen mask boundaries, we introduce a refinement strategy driven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Robot Manipulation and Learning · 3D Shape Modeling and Analysis