DSFormer: A Dual-Scale Cross-Learning Transformer for Visual Place Recognition

Haiyang Jiang; Songhao Piao; Chao Gao; Lei Yu; Liguo Chen

arXiv:2507.18444·cs.CV·July 25, 2025

DSFormer: A Dual-Scale Cross-Learning Transformer for Visual Place Recognition

Haiyang Jiang, Songhao Piao, Chao Gao, Lei Yu, Liguo Chen

PDF

Open Access

TL;DR

DSFormer introduces a dual-scale Transformer framework with a novel clustering strategy to improve visual place recognition robustness and efficiency under environmental and viewpoint variations.

Contribution

The paper presents a dual-scale Transformer with cross-learning and a block clustering strategy, enhancing feature representation and data organization for VPR.

Findings

01

Achieves state-of-the-art performance on benchmark datasets.

02

Reduces training data volume by approximately 30%.

03

Outperforms existing methods like DELG and R2Former in accuracy and efficiency.

Abstract

Visual Place Recognition (VPR) is crucial for robust mobile robot localization, yet it faces significant challenges in maintaining reliable performance under varying environmental conditions and viewpoints. To address this, we propose a novel framework that integrates Dual-Scale-Former (DSFormer), a Transformer-based cross-learning module, with an innovative block clustering strategy. DSFormer enhances feature representation by enabling bidirectional information transfer between dual-scale features extracted from the final two CNN layers, capturing both semantic richness and spatial details through self-attention for long-range dependencies within each scale and shared cross-attention for cross-scale learning. Complementing this, our block clustering strategy repartitions the widely used San Francisco eXtra Large (SF-XL) training dataset from multiple distinct perspectives, optimizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques · Advanced Vision and Imaging