D$^{2}$-VPR: A Parameter-efficient Visual-foundation-model-based Visual Place Recognition Method via Knowledge Distillation and Deformable Aggregation

Zheyuan Zhang; Jiwei Zhang; Boyu Zhou; Linzhimeng Duan; Hong Chen

arXiv:2511.12528·cs.CV·December 30, 2025

D$^{2}$-VPR: A Parameter-efficient Visual-foundation-model-based Visual Place Recognition Method via Knowledge Distillation and Deformable Aggregation

Zheyuan Zhang, Jiwei Zhang, Boyu Zhou, Linzhimeng Duan, Hong Chen

PDF

Open Access 1 Video

TL;DR

This paper introduces D$^{2}$-VPR, a lightweight, knowledge-distilled visual foundation model for visual place recognition that maintains high accuracy while significantly reducing model size and computational requirements.

Contribution

It proposes a novel framework combining knowledge distillation, a Distillation Recovery Module, and a Top-Down-attention-based Deformable Aggregator for efficient VPR.

Findings

01

Reduces model parameters by approximately 64.2%.

02

Achieves competitive performance with state-of-the-art methods.

03

Improves adaptability to irregular structures through deformable aggregation.

Abstract

Visual Place Recognition (VPR) aims to determine the geographic location of a query image by retrieving its most visually similar counterpart from a geo-tagged reference database. Recently, the emergence of the powerful visual foundation model, DINOv2, trained in a self-supervised manner on massive datasets, has significantly improved VPR performance. This improvement stems from DINOv2's exceptional feature generalization capabilities but is often accompanied by increased model complexity and computational overhead that impede deployment on resource-constrained devices. To address this challenge, we propose $D^{2}$ -VPR, a $D$ istillation- and $D$ eformable-based framework that retains the strong feature extraction capabilities of visual foundation models while significantly reducing model parameters and achieving a more favorable performance-efficiency trade-off. Specifically, first, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

D²-VPR: A Parameter-efficient Visual-foundation-model-based Visual Place Recognition Method via Knowledge Distillation and Deformable Aggregation· underline

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Robotics and Sensor-Based Localization