A Faster, Lighter and Stronger Deep Learning-Based Approach for Place Recognition
Rui Huang, Ze Huang, Songzhi Su

TL;DR
This paper introduces a novel deep learning approach for visual place recognition that is faster, lighter, and more accurate, utilizing a new backbone network and a trainable feature matcher to outperform existing methods.
Contribution
The authors propose RepVGG-lite as a new backbone network and a trainable attention-based feature matcher, significantly reducing model size and inference time while improving accuracy in place recognition.
Findings
14x fewer parameters than Patch-NetVLAD
6.8x lower FLOPs than Patch-NetVLAD
0.5% higher Recall@1 than Patch-NetVLAD
Abstract
Visual Place Recognition is an essential component of systems for camera localization and loop closure detection, and it has attracted widespread interest in multiple domains such as computer vision, robotics and AR/VR. In this work, we propose a faster, lighter and stronger approach that can generate models with fewer parameters and can spend less time in the inference stage. We designed RepVGG-lite as the backbone network in our architecture, it is more discriminative than other general networks in the Place Recognition task. RepVGG-lite has more speed advantages while achieving higher performance. We extract only one scale patch-level descriptors from global descriptors in the feature extraction stage. Then we design a trainable feature matcher to exploit both spatial relationships of the features and their visual appearance, which is based on the attention mechanism. Comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods · Automated Road and Building Extraction
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
