LaVPR: Benchmarking Language and Vision for Place Recognition
Ofer Idan, Dan Badur, Yosi Keller, Yoli Shavit

TL;DR
LaVPR introduces a large-scale benchmark dataset with natural language descriptions for visual place recognition, demonstrating improved robustness and enabling language-based localization, especially in challenging environmental conditions.
Contribution
The paper presents LaVPR, a comprehensive dataset with natural language annotations, and explores multi-modal fusion and cross-modal retrieval methods for enhanced place recognition.
Findings
Language descriptions improve recognition under environmental changes.
Compact models with language match larger vision-only models.
Baseline cross-modal retrieval outperforms standard contrastive methods.
Abstract
Visual Place Recognition (VPR) often fails under extreme environmental changes and perceptual aliasing. Furthermore, standard systems cannot perform "blind" localization from verbal descriptions alone, a capability needed for applications such as emergency response. To address these challenges, we introduce LaVPR, a large-scale benchmark that extends existing VPR datasets with over 650,000 rich natural-language descriptions. Using LaVPR, we investigate two paradigms: Multi-Modal Fusion for enhanced robustness and Cross-Modal Retrieval for language-based localization. Our results show that language descriptions yield consistent gains in visually degraded conditions, with the most significant impact on smaller backbones. Notably, adding language allows compact models to rival the performance of much larger vision-only architectures. For cross-modal retrieval, we establish a baseline using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
