VGGT-$\Omega$
Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Sch\"onberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, and Christian Rupprecht

TL;DR
VGGT-$6$ introduces architectural and training innovations that significantly enhance 3D scene reconstruction accuracy, efficiency, and scalability, enabling effective learning from vast unlabeled video data.
Contribution
The paper presents VGGT-$6$, a scalable, efficient model with novel architectural modifications and training protocols for improved static and dynamic scene reconstruction.
Findings
VGGT-$6$ improves camera estimation accuracy on Sintel by 77%.
The model uses only 30% of previous GPU memory, enabling larger-scale training.
Registers and register attention enhance scene understanding and model efficiency.
Abstract
Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We simplify VGGT's architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
