VISTA: Generative Visual Imagination for Vision-and-Language Navigation

Yanjia Huang; Mingyang Wu; Renjie Li; Zhengzhong Tu

arXiv:2505.07868·cs.RO·February 4, 2026

VISTA: Generative Visual Imagination for Vision-and-Language Navigation

Yanjia Huang, Mingyang Wu, Renjie Li, Zhengzhong Tu

PDF

Open Access

TL;DR

VISTA introduces a generative imagination framework using diffusion models to improve vision-and-language navigation, achieving state-of-the-art results by enabling agents to imagine and align visual goals with observations.

Contribution

The paper proposes VISTA, a novel 'imagine-and-align' approach utilizing diffusion models for visual imagination and structured reasoning in VLN tasks, surpassing previous methods.

Findings

01

+3.6% Success Rate on R2R benchmark

02

Sets new state-of-the-art on R2R and RoboTHOR

03

Highlights importance of imagination and alignment in navigation

Abstract

Vision-and-Language Navigation (VLN) tasks agents with locating specific objects in unseen environments using natural language instructions and visual cues. Many existing VLN approaches typically follow an 'observe-and-reason' schema, that is, agents observe the environment and decide on the next action to take based on the visual observations of their surroundings. They often face challenges in long-horizon scenarios due to limitations in immediate observation and vision-language modality gaps. To overcome this, we present VISTA, a novel framework that employs an 'imagine-and-align' navigation strategy. Specifically, we leverage the generative prior of pre-trained diffusion models for dynamic visual imagination conditioned on both local observations and high-level language instructions. A Perceptual Alignment Filter module then grounds these goal imaginations against current…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning