FM-Loc: Using Foundation Models for Improved Vision-based Localization
Reihaneh Mirjalili, Michael Krawez, Wolfram Burgard

TL;DR
FM-Loc introduces a novel vision-based localization method leveraging foundation models like CLIP and GPT-3 to create semantic image descriptors, improving robustness to environmental changes without training.
Contribution
The paper presents a new localization approach using foundation models for semantic scene understanding, eliminating the need for training or fine-tuning.
Findings
Effective in indoor environments with viewpoint and object placement changes
No training or fine-tuning required for the approach
Demonstrates robustness in real-world scenarios
Abstract
Visual place recognition is essential for vision-based robot localization and SLAM. Despite the tremendous progress made in recent years, place recognition in changing environments remains challenging. A promising approach to cope with appearance variations is to leverage high-level semantic features like objects or place categories. In this paper, we propose FM-Loc which is a novel image-based localization approach based on Foundation Models that uses the Large Language Model GPT-3 in combination with the Visual-Language Model CLIP to construct a semantic image descriptor that is robust to severe changes in scene geometry and camera viewpoint. We deploy CLIP to detect objects in an image, GPT-3 to suggest potential room labels based on the detected objects, and CLIP again to propose the most likely location label. The object labels and the scene label constitute an image descriptor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
