LVLM-empowered Multi-modal Representation Learning for Visual Place   Recognition

Teng Wang; Lingquan Meng; Lei Cheng; Changyin Sun

arXiv:2407.06730·cs.CV·July 10, 2024

LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition

Teng Wang, Lingquan Meng, Lei Cheng, Changyin Sun

PDF

Open Access

TL;DR

This paper introduces a multi-modal approach combining image data and text descriptions using large vision-language models to improve visual place recognition, especially under challenging environmental conditions.

Contribution

It proposes a novel multi-modal VPR framework that fuses visual and textual features with adaptive attention mechanisms, outperforming existing methods.

Findings

01

Outperforms state-of-the-art methods significantly.

02

Uses smaller feature descriptors for efficient retrieval.

03

Enhances robustness against environmental variations.

Abstract

Visual place recognition (VPR) remains challenging due to significant viewpoint changes and appearance variations. Mainstream works tackle these challenges by developing various feature aggregation methods to transform deep features into robust and compact global representations. Unfortunately, satisfactory results cannot be achieved under challenging conditions. We start from a new perspective and attempt to build a discriminative global representations by fusing image data and text descriptions of the the visual scene. The motivation is twofold: (1) Current Large Vision-Language Models (LVLMs) demonstrate extraordinary emergent capability in visual instruction following, and thus provide an efficient and flexible manner in generating text descriptions of images; (2) The text descriptions, which provide high-level scene understanding, show strong robustness against environment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Multimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need