NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants
Yiran Qin, Ao Sun, Yuze Hong, Benyou Wang, Ruimao Zhang

TL;DR
NavigateDiff leverages vision-language models and diffusion networks to enable zero-shot visual navigation, improving robot adaptability and efficiency in unfamiliar environments without extensive retraining.
Contribution
The paper introduces NavigateDiff, a novel approach that combines large vision-language models with diffusion networks to predict future observations and guide zero-shot navigation.
Findings
Enhanced navigation robustness in diverse environments
Effective generalization to unseen scenes
Improved efficiency over traditional RL methods
Abstract
Navigating unfamiliar environments presents significant challenges for household robots, requiring the ability to recognize and reason about novel decoration and layout. Existing reinforcement learning methods cannot be directly transferred to new environments, as they typically rely on extensive mapping and exploration, leading to time-consuming and inefficient. To address these challenges, we try to transfer the logical knowledge and the generalization ability of pre-trained foundation models to zero-shot navigation. By integrating a large vision-language model with a diffusion network, our approach named \mname ~constructs a visual predictor that continuously predicts the agent's potential observations in the next step which can assist robots generate robust actions. Furthermore, to adapt the temporal property of navigation, we introduce temporal historical information to ensure that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsDiffusion
