Bridging Text and Vision: A Multi-View Text-Vision Registration Approach   for Cross-Modal Place Recognition

Tianyi Shang; Zhenyu Li; Pengjie Xu; Jinwei Qiao; Gang Chen; Zihan; Ruan; Weijun Hu

arXiv:2502.14195·cs.CV·March 10, 2025

Bridging Text and Vision: A Multi-View Text-Vision Registration Approach for Cross-Modal Place Recognition

Tianyi Shang, Zhenyu Li, Pengjie Xu, Jinwei Qiao, Gang Chen, Zihan, Ruan, Weijun Hu

PDF

Open Access 1 Repo

TL;DR

This paper introduces Text4VPR, a novel multi-view text-vision registration method that uses textual descriptions for place recognition, achieving high accuracy and demonstrating the feasibility of location identification solely from language descriptions.

Contribution

It is the first approach to match textual descriptions with multi-view images for place recognition, utilizing a frozen language model and novel alignment techniques.

Findings

01

Achieved 57% top-1 accuracy on Street360Loc dataset.

02

Demonstrated the feasibility of text-based place recognition.

03

Established a robust baseline for future research.

Abstract

Mobile robots necessitate advanced natural language understanding capabilities to accurately identify locations and perform tasks such as package delivery. However, traditional visual place recognition (VPR) methods rely solely on single-view visual information and cannot interpret human language descriptions. To overcome this challenge, we bridge text and vision by proposing a multiview (360{\deg} views of the surroundings) text-vision registration approach called Text4VPR for place recognition task, which is the first method that exclusively utilizes textual descriptions to match a database of images. Text4VPR employs the frozen T5 language model to extract global textual embeddings. Additionally, it utilizes the Sinkhorn algorithm with temperature coefficient to assign local tokens to their respective clusters, thereby aggregating visual descriptors from images. During the training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nuozimiaowu/Text4VPR
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · SentencePiece · Linear Layer · Inverse Square Root Schedule · Adafactor · Layer Normalization · Residual Connection · Dense Connections