ImagiNav: Scalable Embodied Navigation via Generative Visual Prediction and Inverse Dynamics
Jie Chen, Yuxin Cai, Yizhuo Wang, Ruofei Bai, Yuhong Cao, Jun Li, Yau Wei Yun, and Guillaume Sartoretti

TL;DR
ImagiNav introduces a modular, scalable approach for embodied navigation that leverages generative visual prediction and inverse dynamics, enabling robots to learn navigation from diverse, unlabeled real-world videos without robot-specific training.
Contribution
The paper presents a novel modular framework that decouples visual planning from actuation, utilizing in-the-wild videos and a hierarchy of models for zero-shot robot navigation.
Findings
Strong zero-shot transfer to robot navigation.
Effective utilization of in-the-wild navigation videos.
No need for robot demonstrations during training.
Abstract
Enabling robots to navigate open-world environments via natural language is critical for general-purpose autonomy. Yet, Vision-Language Navigation has relied on end-to-end policies trained on expensive, embodiment-specific robot data. While recent foundation models trained on vast simulation data show promise, the challenge of scaling and generalizing due to the limited scene diversity and visual fidelity in simulation persists. To address this gap, we propose ImagiNav, a novel modular paradigm that decouples visual planning from robot actuation, enabling the direct utilization of diverse in-the-wild navigation videos. Our framework operates as a hierarchy: a Vision-Language Model first decomposes instructions into textual subgoals; a finetuned generative video model then imagines the future video trajectory towards that subgoal; finally, an inverse dynamics model extracts the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI
