The Wallpaper is Ugly: Indoor Localization using Vision and Language
Seth Pate, Lawson L.S. Wong

TL;DR
This paper presents a vision-language model-based method for indoor localization that uses natural language and images to identify user location, outperforming humans in some cases and generalizing to unseen environments.
Contribution
It introduces a novel approach leveraging pretrained vision-language models for indoor localization using natural language queries and images, capable of generalizing to unseen environments.
Findings
Finetuned CLIP outperformed humans in localization accuracy.
The method generalizes to environments, text, and images not seen during training.
Achieves effective indoor localization using natural language and visual data.
Abstract
We study the task of locating a user in a mapped indoor environment using natural language queries and images from the environment. Building on recent pretrained vision-language models, we learn a similarity score between text descriptions and images of locations in the environment. This score allows us to identify locations that best match the language query, estimating the user's location. Our approach is capable of localizing on environments, text, and images that were not seen during training. One model, finetuned CLIP, outperformed humans in our evaluation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Indoor and Outdoor Localization Technologies
MethodsContrastive Language-Image Pre-training
