The Wallpaper is Ugly: Indoor Localization using Vision and Language

Seth Pate; Lawson L.S. Wong

arXiv:2410.03900·cs.CV·October 8, 2024

The Wallpaper is Ugly: Indoor Localization using Vision and Language

Seth Pate, Lawson L.S. Wong

PDF

Open Access

TL;DR

This paper presents a vision-language model-based method for indoor localization that uses natural language and images to identify user location, outperforming humans in some cases and generalizing to unseen environments.

Contribution

It introduces a novel approach leveraging pretrained vision-language models for indoor localization using natural language queries and images, capable of generalizing to unseen environments.

Findings

01

Finetuned CLIP outperformed humans in localization accuracy.

02

The method generalizes to environments, text, and images not seen during training.

03

Achieves effective indoor localization using natural language and visual data.

Abstract

We study the task of locating a user in a mapped indoor environment using natural language queries and images from the environment. Building on recent pretrained vision-language models, we learn a similarity score between text descriptions and images of locations in the environment. This score allows us to identify locations that best match the language query, estimating the user's location. Our approach is capable of localizing on environments, text, and images that were not seen during training. One model, finetuned CLIP, outperformed humans in our evaluation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Indoor and Outdoor Localization Technologies

MethodsContrastive Language-Image Pre-training