ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Teng Wang; Xinxin Zhao; Wenzhe Cai; Changyin Sun

arXiv:2512.17435·cs.RO·May 1, 2026

ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Teng Wang, Xinxin Zhao, Wenzhe Cai, Changyin Sun

PDF

1 Video

TL;DR

ImagineNav++ leverages vision-language models with scene imagination and memory mechanisms to enable mapless visual navigation, achieving state-of-the-art results in object and instance navigation tasks.

Contribution

This work introduces a novel imagination-powered navigation framework that transforms spatial reasoning into visual viewpoint selection using VLMs, enhancing mapless navigation capabilities.

Findings

01

Achieves state-of-the-art performance in mapless object and instance navigation benchmarks.

02

Outperforms most map-based methods in open-vocabulary navigation tasks.

03

Demonstrates the effectiveness of scene imagination and memory in VLM-based spatial reasoning.

Abstract

Visual navigation is a fundamental capability for autonomous home-assistance robots, enabling long-horizon tasks such as object search. While recent methods have leveraged Large Language Models (LLMs) to incorporate commonsense reasoning and improve exploration efficiency, their planning remains constrained by textual representations, which cannot adequately capture spatial occupancy or scene geometry--critical factors for navigation decisions. We explore whether Vision-Language Models (VLMs) can achieve mapless visual navigation using only onboard RGB/RGB-D streams, unlocking their potential for spatial perception and planning. We achieve this through an imagination-powered navigation framework, ImagineNav++, which imagines future observation images from candidate robot views and translates navigation planning into a simple best-view image selection problem for VLMs. First, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination· slideslive