Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

Zonglin Lyu; Juexiao Zhang; Mingxuan Lu; Yiming Li; Chen Feng

arXiv:2406.17520·cs.CV·June 26, 2024·1 cites

Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

Zonglin Lyu, Juexiao Zhang, Mingxuan Lu, Yiming Li, Chen Feng

PDF

Open Access

TL;DR

This paper introduces a novel approach combining vision foundation models and multimodal large language models to improve visual place recognition in robotics, leveraging visual retrieval and language reasoning without specialized training.

Contribution

The work presents a new multimodal LLM-based framework for visual place recognition that integrates visual features and language reasoning, avoiding VPR-specific supervised training.

Findings

01

Effective place recognition on three datasets

02

No VPR-specific supervised training needed

03

Combines visual features with language reasoning

Abstract

Large language models (LLMs) exhibit a variety of promising capabilities in robotics, including long-horizon planning and commonsense reasoning. However, their performance in place recognition is still underexplored. In this work, we introduce multimodal LLMs (MLLMs) to visual place recognition (VPR), where a robot must localize itself using visual observations. Our key design is to use vision-based retrieval to propose several candidates and then leverage language-based reasoning to carefully inspect each candidate for a final decision. Specifically, we leverage the robust visual features produced by off-the-shelf vision foundation models (VFMs) to obtain several candidate locations. We then prompt an MLLM to describe the differences between the current observation and each candidate in a pairwise manner, and reason about the best candidate based on these descriptions. Our results on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Semantic Web and Ontologies