EDTformer: An Efficient Decoder Transformer for Visual Place Recognition

Tong Jin; Feng Lu; Shuyu Hu; Chun Yuan; Yunpeng Liu

arXiv:2412.00784·cs.CV·May 27, 2025

EDTformer: An Efficient Decoder Transformer for Visual Place Recognition

Tong Jin, Feng Lu, Shuyu Hu, Chun Yuan, Yunpeng Liu

PDF

Open Access 1 Repo

TL;DR

EDTformer introduces an efficient transformer decoder architecture for visual place recognition, leveraging deep feature aggregation and a novel backbone enhancement to improve accuracy and robustness over existing methods.

Contribution

The paper proposes a new decoder transformer architecture for VPR, utilizing deep feature decoding and a backbone enhancement technique called LoPA for improved global representations.

Findings

01

Outperforms single-stage VPR methods on multiple benchmarks.

02

Surpasses two-stage methods with re-ranking in accuracy and efficiency.

03

Effective use of deep features and backbone refinement improves robustness.

Abstract

Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer, and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability to capture contextual dependencies and generate accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly produce robust and discriminative global representations.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tong-jin01/edtformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Gaze Tracking and Assistive Technology · Image Processing Techniques and Applications

MethodsAttention Is All You Need · Absolute Position Encodings · Residual Connection · Adam · Softmax · Label Smoothing · Dropout · Sparse Evolutionary Training · Dense Connections · Layer Normalization