Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era
Feng Lu, Tong Jin, Canming Ye, Yunpeng Liu, Xiangyuan Lan, Chun Yuan

TL;DR
This paper proposes a novel approach for visual place recognition using implicit aggregation with learnable tokens in transformers, eliminating the need for dedicated aggregators and achieving state-of-the-art results.
Contribution
It introduces learnable aggregation tokens within the transformer backbone, simplifying global descriptor extraction for place recognition.
Findings
Outperforms state-of-the-art methods on VPR datasets
Achieves higher efficiency in global descriptor computation
Ranks 1st on the MSLS challenge leaderboard
Abstract
Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e.g., NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Robotics and Sensor-Based Localization
