SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes

Jungho Lee; Minhyeok Lee; Sunghun Yang; Minseok Kang; Sangyoun Lee

arXiv:2511.18290·cs.CV·November 25, 2025

SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes

Jungho Lee, Minhyeok Lee, Sunghun Yang, Minseok Kang, Sangyoun Lee

PDF

Open Access

TL;DR

SwiftVGGT is a training-free, scalable 3D reconstruction method that significantly reduces inference time while maintaining high quality in large-scale scenes, using innovative loop closure and point sampling techniques.

Contribution

The paper introduces SwiftVGGT, a novel approach that eliminates the need for external models and IRLS optimization, enabling fast, high-quality large-scale 3D reconstruction.

Findings

01

Achieves state-of-the-art reconstruction quality.

02

Requires only 33% of the inference time of recent methods.

03

Effectively maintains global consistency in kilometer-scale environments.

Abstract

3D reconstruction in large-scale scenes is a fundamental task in 3D perception, but the inherent trade-off between accuracy and computational efficiency remains a significant challenge. Existing methods either prioritize speed and produce low-quality results, or achieve high-quality reconstruction at the cost of slow inference times. In this paper, we propose SwiftVGGT, a training-free method that significantly reduce inference time while preserving high-quality dense 3D reconstruction. To maintain global consistency in large-scale scenes, SwiftVGGT performs loop closure without relying on the external Visual Place Recognition (VPR) model. This removes redundant computation and enables accurate reconstruction over kilometer-scale environments. Furthermore, we propose a simple yet effective point sampling method to align neighboring chunks using a single Sim(3)-based Singular Value…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization