VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation
Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, Bernadette, Bucher

TL;DR
VLFM introduces a zero-shot semantic navigation method using vision-language models to identify and reach unseen objects in novel environments, achieving state-of-the-art results in simulated datasets and successful real-world deployment.
Contribution
The paper presents VLFM, a novel approach combining occupancy maps and vision-language models for zero-shot semantic navigation in unseen environments.
Findings
Achieves state-of-the-art SPL on Gibson, HM3D, and MP3D datasets.
Successfully deployed on real-world robot (Boston Dynamics Spot).
Demonstrates effective zero-shot navigation without prior environment knowledge.
Abstract
Understanding how humans leverage semantic knowledge to navigate unfamiliar environments and decide where to explore next is pivotal for developing robots capable of human-like search behaviors. We introduce a zero-shot navigation approach, Vision-Language Frontier Maps (VLFM), which is inspired by human reasoning and designed to navigate towards unseen semantic objects in novel environments. VLFM builds occupancy maps from depth observations to identify frontiers, and leverages RGB observations and a pre-trained vision-language model to generate a language-grounded value map. VLFM then uses this map to identify the most promising frontier to explore for finding an instance of a given target object category. We evaluate VLFM in photo-realistic environments from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D) datasets within the Habitat simulator. Remarkably, VLFM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Robotics and Sensor-Based Localization
