ImagineNav: Prompting Vision-Language Models as Embodied Navigator   through Scene Imagination

Xinxin Zhao; Wenzhe Cai; Likun Tang; Teng Wang

arXiv:2410.09874·cs.RO·October 15, 2024·2 cites

ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Xinxin Zhao, Wenzhe Cai, Likun Tang, Teng Wang

PDF

Open Access

TL;DR

ImagineNav leverages vision-language models' spatial and planning abilities by enabling them to imagine future views, transforming navigation into a view selection task, and demonstrating superior performance in object navigation benchmarks.

Contribution

This work introduces ImagineNav, a novel framework that enhances VLMs with scene imagination for mapless visual navigation using only onboard RGB/RGB-D inputs.

Findings

01

Outperforms existing methods on open-vocabulary object navigation benchmarks.

02

Effectively transforms navigation planning into a view selection problem.

03

Demonstrates the potential of VLMs for embodied navigation tasks.

Abstract

Visual navigation is an essential skill for home-assistance robots, providing the object-searching ability to accomplish long-horizon daily tasks. Many recent approaches use Large Language Models (LLMs) for commonsense inference to improve exploration efficiency. However, the planning process of LLMs is limited within texts and it is difficult to represent the spatial occupancy and geometry layout only by texts. Both are important for making rational navigation decisions. In this work, we seek to unleash the spatial perception and planning ability of Vision-Language Models (VLMs), and explore whether the VLM, with only on-board camera captured RGB/RGB-D stream inputs, can efficiently finish the visual navigation tasks in a mapless manner. We achieve this by developing the imagination-powered navigation framework ImagineNav, which imagines the future observation images at valuable robot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition