VISTA: Generative Visual Imagination for Vision-and-Language Navigation
Yanjia Huang, Mingyang Wu, Renjie Li, Zhengzhong Tu

TL;DR
VISTA introduces a generative imagination framework using diffusion models to improve vision-and-language navigation, achieving state-of-the-art results by enabling agents to imagine and align visual goals with observations.
Contribution
The paper proposes VISTA, a novel 'imagine-and-align' approach utilizing diffusion models for visual imagination and structured reasoning in VLN tasks, surpassing previous methods.
Findings
+3.6% Success Rate on R2R benchmark
Sets new state-of-the-art on R2R and RoboTHOR
Highlights importance of imagination and alignment in navigation
Abstract
Vision-and-Language Navigation (VLN) tasks agents with locating specific objects in unseen environments using natural language instructions and visual cues. Many existing VLN approaches typically follow an 'observe-and-reason' schema, that is, agents observe the environment and decide on the next action to take based on the visual observations of their surroundings. They often face challenges in long-horizon scenarios due to limitations in immediate observation and vision-language modality gaps. To overcome this, we present VISTA, a novel framework that employs an 'imagine-and-align' navigation strategy. Specifically, we leverage the generative prior of pre-trained diffusion models for dynamic visual imagination conditioned on both local observations and high-level language instructions. A Perceptual Alignment Filter module then grounds these goal imaginations against current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning
