EDTformer: An Efficient Decoder Transformer for Visual Place Recognition
Tong Jin, Feng Lu, Shuyu Hu, Chun Yuan, Yunpeng Liu

TL;DR
EDTformer introduces an efficient transformer decoder architecture for visual place recognition, leveraging deep feature aggregation and a novel backbone enhancement to improve accuracy and robustness over existing methods.
Contribution
The paper proposes a new decoder transformer architecture for VPR, utilizing deep feature decoding and a backbone enhancement technique called LoPA for improved global representations.
Findings
Outperforms single-stage VPR methods on multiple benchmarks.
Surpasses two-stage methods with re-ranking in accuracy and efficiency.
Effective use of deep features and backbone refinement improves robustness.
Abstract
Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer, and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability to capture contextual dependencies and generate accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly produce robust and discriminative global representations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Gaze Tracking and Assistive Technology · Image Processing Techniques and Applications
MethodsAttention Is All You Need · Absolute Position Encodings · Residual Connection · Adam · Softmax · Label Smoothing · Dropout · Sparse Evolutionary Training · Dense Connections · Layer Normalization
