UniPR-3D: Towards Universal Visual Place Recognition with Visual Geometry Grounded Transformer
Tianchen Deng, Xun Chen, Ziming Li, Hongming Shen, Danwei Wang, Javier Civera, Hesheng Wang

TL;DR
UniPR-3D introduces a novel multi-view visual place recognition architecture that leverages 3D and 2D features through a geometry-grounded transformer, significantly improving generalization and state-of-the-art performance.
Contribution
It is the first VPR method to effectively integrate multi-view 3D representations with dedicated feature aggregation modules for enhanced recognition.
Findings
Sets new state-of-the-art performance on VPR benchmarks.
Outperforms single- and multi-view baselines.
Demonstrates strong generalization across diverse environments.
Abstract
Visual Place Recognition (VPR) has been traditionally formulated as a single-image retrieval task. Using multiple views offers clear advantages, yet this setting remains relatively underexplored and existing methods often struggle to generalize across diverse environments. In this work we introduce UniPR-3D, the first VPR architecture that effectively integrates information from multiple views. UniPR-3D builds on a VGGT backbone capable of encoding multi-view 3D representations, which we adapt by designing feature aggregators and fine-tune for the place recognition task. To construct our descriptor, we jointly leverage the 3D tokens and intermediate 2D tokens produced by VGGT. Based on their distinct characteristics, we design dedicated aggregation modules for 2D and 3D features, allowing our descriptor to capture fine-grained texture cues while also reasoning across viewpoints. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Multimodal Machine Learning Applications
