Reloc-VGGT: Visual Re-localization with Geometry Grounded Transformer

Tianchen Deng; Wenhua Wu; Kunzhen Wu; Guangming Wang; Siting Zhu; Shenghai Yuan; Xun Chen; Guole Shen; Zhe Liu; Hesheng Wang

arXiv:2512.21883·cs.CV·December 29, 2025

Reloc-VGGT: Visual Re-localization with Geometry Grounded Transformer

Tianchen Deng, Wenhua Wu, Kunzhen Wu, Guangming Wang, Siting Zhu, Shenghai Yuan, Xun Chen, Guole Shen, Zhe Liu, Hesheng Wang

PDF

Open Access

TL;DR

Reloc-VGGT introduces a multi-view spatial integration framework with a geometry-grounded transformer for real-time, robust visual re-localization in diverse environments, outperforming traditional pair-wise methods.

Contribution

It is the first to perform multi-view spatial integration via early fusion using a geometry-grounded transformer for visual localization.

Findings

01

Achieves high accuracy in diverse environments.

02

Operates in real-time with reduced computational cost.

03

Demonstrates strong generalization across datasets.

Abstract

Visual localization has traditionally been formulated as a pair-wise pose regression problem. Existing approaches mainly estimate relative poses between two images and employ a late-fusion strategy to obtain absolute pose estimates. However, the late motion average is often insufficient for effectively integrating spatial information, and its accuracy degrades in complex environments. In this paper, we present the first visual localization framework that performs multi-view spatial integration through an early-fusion mechanism, enabling robust operation in both structured and unstructured environments. Our framework is built upon the VGGT backbone, which encodes multi-view 3D geometry, and we introduce a pose tokenizer and projection module to more effectively exploit spatial relationships from multiple database views. Furthermore, we propose a novel sparse mask attention strategy that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques · Advanced Vision and Imaging