VISTAv2: World Imagination for Indoor Vision-and-Language Navigation

Yanjia Huang; Xianshun Jiang; Xiangbo Gao; Mingyang Wu; Zhengzhong Tu

arXiv:2512.00041·cs.RO·December 2, 2025

VISTAv2: World Imagination for Indoor Vision-and-Language Navigation

Yanjia Huang, Xianshun Jiang, Xiangbo Gao, Mingyang Wu, Zhengzhong Tu

PDF

Open Access

TL;DR

VISTAv2 introduces a generative world model for indoor vision-and-language navigation that predicts egocentric future views and constructs an online value map for improved planning and navigation accuracy.

Contribution

It presents a novel action-conditioned imagination framework that integrates a vision-language scorer with an online value map, enhancing robustness and interpretability in VLN tasks.

Findings

01

VISTAv2 outperforms strong baselines on MP3D and RoboTHOR datasets.

02

Action-conditioned imagination and online value fusion are critical for performance.

03

The model achieves efficient inference on a single GPU.

Abstract

Vision-and-Language Navigation (VLN) requires agents to follow language instructions while acting in continuous real-world spaces. Prior image imagination based VLN work shows benefits for discrete panoramas but lacks online, action-conditioned predictions and does not produce explicit planning values; moreover, many methods replace the planner with long-horizon objectives that are brittle and slow. To bridge this gap, we propose VISTAv2, a generative world model that rolls out egocentric future views conditioned on past observations, candidate action sequences, and instructions, and projects them into an online value map for planning. Unlike prior approaches, VISTAv2 does not replace the planner. The online value map is fused at score level with the base objective, providing reachability and risk-aware guidance. Concretely, we employ an action-aware Conditional Diffusion Transformer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Social Robot Interaction and HRI