EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition
Issar Tzachor, Boaz Lerner, Matan Levy, Michael Green, Tal Berkovitz Shalev, Gavriel Habib, Dvir Samuel, Noam Korngut Zailer, Or Shimshi, Nir Darshan, Rami Ben-Ari

TL;DR
This paper introduces a novel method leveraging foundation models like DINOv2 for Visual Place Recognition, achieving state-of-the-art results in zero-shot and supervised settings with robust, compact features.
Contribution
It demonstrates that features from self-attention layers can effectively re-rank and recognize places without fine-tuning, and introduces a single-stage global feature extraction approach for VPR.
Findings
Outperforms previous zero-shot VPR methods
Achieves state-of-the-art performance with 128D features
Demonstrates robustness across challenging conditions
Abstract
The task of Visual Place Recognition (VPR) is to predict the location of a query image from a database of geo-tagged images. Recent studies in VPR have highlighted the significant advantage of employing pre-trained foundation models like DINOv2 for the VPR task. However, these models are often deemed inadequate for VPR without further fine-tuning on VPR-specific data. In this paper, we present an effective approach to harness the potential of a foundation model for VPR. We show that features extracted from self-attention layers can act as a powerful re-ranker for VPR, even in a zero-shot setting. Our method not only outperforms previous zero-shot approaches but also introduces results competitive with several supervised methods. We then show that a single-stage approach utilizing internal ViT layers for pooling can produce global features that achieve state-of-the-art performance, with…
Peer Reviews
Decision·ICLR 2025 Poster
1. The proposed method is easy to follow. It leverages the foundation model’s feature representation capability to improve the effectiveness of image retrieval and re-ranking. 2. The proposed method outperforms some existing approaches on several VPR benchmarks.
1. Since the proposed method mainly leverages the strong representation power of a foundation model (DINOv2), the technical novelty of the proposed method in VPR is somewhat incremental. 2. Certain design choices, such as using the V_l matrix for keypoint descriptors, lack clear motivation and justification, particularly in explaining why such choices are effective. 3. Although the proposed method outperforms several existing methods, the improvement could come solely from using a stronger foun
1. The methods proposed in the paper are generally novel. 2. The experiments provided in the paper are sufficient. The author provides the results of other SOTA methods on datasets such as SF-XL. As far as I know, completing such evaluation experiments requires a lot of time and effort. 3. The performance of the proposed method is really good. It achieves excellent results even with compact global features.
1. The methodological contribution of this paper is a little weak. Its training strategy follows the EigenPlaces work. The global feature is the class token (following a linear layer) of the ViT model. The local matching in the re-ranking process is based on the mutual nearest neighbors searching, and it also already exists. I think the authors should state that the previous SelaVPR work also does not require spatial verification in re-ranking. The biggest contribution of this work seems to be t
- S1: The paper is clear and easy to read. - S2: Simple, yet effective and robust method that improves the performance over the DINOv2 baseline and other state-of-the-art methods, while reducing the required dimensionality of the used features. - S3: The method finetunes DINOv2 to achieve better performance but is still effective even in a zero-shot setting.
- W1 Even though the results of the three proposed variants of the method (EffoVPR-ZS, EffoVPR-G and EffoVPR-R) are provided, the paper is missing their comparison in a single table, to directly see how large is the effect of finetuning and re-ranking. It would be good to see this comparison in a single table to understand (and compare) the direct benefits of the different components of the proposed approach. - W2 The notation for Queries, Keys and Values and corresponding local features would
Videos
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques · Advanced Vision and Imaging
