ImagiNav: Scalable Embodied Navigation via Generative Visual Prediction and Inverse Dynamics

Jie Chen; Yuxin Cai; Yizhuo Wang; Ruofei Bai; Yuhong Cao; Jun Li; Yau Wei Yun; and Guillaume Sartoretti

arXiv:2603.13833·cs.RO·March 17, 2026

ImagiNav: Scalable Embodied Navigation via Generative Visual Prediction and Inverse Dynamics

Jie Chen, Yuxin Cai, Yizhuo Wang, Ruofei Bai, Yuhong Cao, Jun Li, Yau Wei Yun, and Guillaume Sartoretti

PDF

Open Access

TL;DR

ImagiNav introduces a modular, scalable approach for embodied navigation that leverages generative visual prediction and inverse dynamics, enabling robots to learn navigation from diverse, unlabeled real-world videos without robot-specific training.

Contribution

The paper presents a novel modular framework that decouples visual planning from actuation, utilizing in-the-wild videos and a hierarchy of models for zero-shot robot navigation.

Findings

01

Strong zero-shot transfer to robot navigation.

02

Effective utilization of in-the-wild navigation videos.

03

No need for robot demonstrations during training.

Abstract

Enabling robots to navigate open-world environments via natural language is critical for general-purpose autonomy. Yet, Vision-Language Navigation has relied on end-to-end policies trained on expensive, embodiment-specific robot data. While recent foundation models trained on vast simulation data show promise, the challenge of scaling and generalizing due to the limited scene diversity and visual fidelity in simulation persists. To address this gap, we propose ImagiNav, a novel modular paradigm that decouples visual planning from robot actuation, enabling the direct utilization of diverse in-the-wild navigation videos. Our framework operates as a hierarchy: a Vision-Language Model first decomposes instructions into textual subgoals; a finetuned generative video model then imagines the future video trajectory towards that subgoal; finally, an inverse dynamics model extracts the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI