Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era

Feng Lu; Tong Jin; Canming Ye; Yunpeng Liu; Xiangyuan Lan; Chun Yuan

arXiv:2511.06024·cs.CV·January 19, 2026

Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era

Feng Lu, Tong Jin, Canming Ye, Yunpeng Liu, Xiangyuan Lan, Chun Yuan

PDF

Open Access 1 Models

TL;DR

This paper proposes a novel approach for visual place recognition using implicit aggregation with learnable tokens in transformers, eliminating the need for dedicated aggregators and achieving state-of-the-art results.

Contribution

It introduces learnable aggregation tokens within the transformer backbone, simplifying global descriptor extraction for place recognition.

Findings

01

Outperforms state-of-the-art methods on VPR datasets

02

Achieves higher efficiency in global descriptor computation

03

Ranks 1st on the MSLS challenge leaderboard

Abstract

Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e.g., NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
fenglu96/ImAge4VPR
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Robotics and Sensor-Based Localization