SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes
Jungho Lee, Minhyeok Lee, Sunghun Yang, Minseok Kang, Sangyoun Lee

TL;DR
SwiftVGGT is a training-free, scalable 3D reconstruction method that significantly reduces inference time while maintaining high quality in large-scale scenes, using innovative loop closure and point sampling techniques.
Contribution
The paper introduces SwiftVGGT, a novel approach that eliminates the need for external models and IRLS optimization, enabling fast, high-quality large-scale 3D reconstruction.
Findings
Achieves state-of-the-art reconstruction quality.
Requires only 33% of the inference time of recent methods.
Effectively maintains global consistency in kilometer-scale environments.
Abstract
3D reconstruction in large-scale scenes is a fundamental task in 3D perception, but the inherent trade-off between accuracy and computational efficiency remains a significant challenge. Existing methods either prioritize speed and produce low-quality results, or achieve high-quality reconstruction at the cost of slow inference times. In this paper, we propose SwiftVGGT, a training-free method that significantly reduce inference time while preserving high-quality dense 3D reconstruction. To maintain global consistency in large-scale scenes, SwiftVGGT performs loop closure without relying on the external Visual Place Recognition (VPR) model. This removes redundant computation and enables accurate reconstruction over kilometer-scale environments. Furthermore, we propose a simple yet effective point sampling method to align neighboring chunks using a single Sim(3)-based Singular Value…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization
