Bridging Text and Vision: A Multi-View Text-Vision Registration Approach for Cross-Modal Place Recognition
Tianyi Shang, Zhenyu Li, Pengjie Xu, Jinwei Qiao, Gang Chen, Zihan, Ruan, Weijun Hu

TL;DR
This paper introduces Text4VPR, a novel multi-view text-vision registration method that uses textual descriptions for place recognition, achieving high accuracy and demonstrating the feasibility of location identification solely from language descriptions.
Contribution
It is the first approach to match textual descriptions with multi-view images for place recognition, utilizing a frozen language model and novel alignment techniques.
Findings
Achieved 57% top-1 accuracy on Street360Loc dataset.
Demonstrated the feasibility of text-based place recognition.
Established a robust baseline for future research.
Abstract
Mobile robots necessitate advanced natural language understanding capabilities to accurately identify locations and perform tasks such as package delivery. However, traditional visual place recognition (VPR) methods rely solely on single-view visual information and cannot interpret human language descriptions. To overcome this challenge, we bridge text and vision by proposing a multiview (360{\deg} views of the surroundings) text-vision registration approach called Text4VPR for place recognition task, which is the first method that exclusively utilizes textual descriptions to match a database of images. Text4VPR employs the frozen T5 language model to extract global textual embeddings. Additionally, it utilizes the Sinkhorn algorithm with temperature coefficient to assign local tokens to their respective clusters, thereby aggregating visual descriptors from images. During the training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · SentencePiece · Linear Layer · Inverse Square Root Schedule · Adafactor · Layer Normalization · Residual Connection · Dense Connections
