TL;DR
This paper introduces GPA-VGGT, a self-supervised learning framework for VGGT models that improves large-scale camera localization without requiring labeled data, by leveraging geometric and physical constraints.
Contribution
It extends VGGT with a self-supervised training method using sequence-wise geometric constraints and physical photometric consistency, enabling effective large-scale localization.
Findings
Model converges within hundreds of iterations.
Achieves significant improvements in large-scale localization.
Effectively captures multi-view geometry through joint optimization.
Abstract
Transformer-based general visual geometry frameworks have shown promising performance in camera pose estimation and 3D scene understanding. Recent advancements in Visual Geometry Grounded Transformer (VGGT) models have shown great promise in camera pose estimation and 3D reconstruction. However, these models typically rely on ground truth labels for training, posing challenges when adapting to unlabeled and unseen scenes. In this paper, we propose a self-supervised framework to train VGGT with unlabeled data, thereby enhancing its localization capability in large-scale environments. To achieve this, we extend conventional pair-wise relations to sequence-wise geometric constraints for self-supervised learning. Specifically, in each sequence, we sample multiple source frames and geometrically project them onto different target frames, which improves temporal feature consistency. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
