LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition
Teng Wang, Lingquan Meng, Lei Cheng, Changyin Sun

TL;DR
This paper introduces a multi-modal approach combining image data and text descriptions using large vision-language models to improve visual place recognition, especially under challenging environmental conditions.
Contribution
It proposes a novel multi-modal VPR framework that fuses visual and textual features with adaptive attention mechanisms, outperforming existing methods.
Findings
Outperforms state-of-the-art methods significantly.
Uses smaller feature descriptors for efficient retrieval.
Enhances robustness against environmental variations.
Abstract
Visual place recognition (VPR) remains challenging due to significant viewpoint changes and appearance variations. Mainstream works tackle these challenges by developing various feature aggregation methods to transform deep features into robust and compact global representations. Unfortunately, satisfactory results cannot be achieved under challenging conditions. We start from a new perspective and attempt to build a discriminative global representations by fusing image data and text descriptions of the the visual scene. The motivation is twofold: (1) Current Large Vision-Language Models (LVLMs) demonstrate extraordinary emergent capability in visual instruction following, and thus provide an efficient and flexible manner in generating text descriptions of images; (2) The text descriptions, which provide high-level scene understanding, show strong robustness against environment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Multimodal Machine Learning Applications
MethodsSoftmax · Attention Is All You Need
